Skip to main content

HuggingFace community-driven open-source library for dataset disaggregation

Project description


Hugging Face Disaggregators

GitHub GitHub release

⚠️ Please note: This library is in early development, and the disaggregation modules that are included are proofs of concept that are not production-ready. Additionally, all APIs are subject to breaking changes any time before a 1.0.0 release. Rigorously tested versions of the included modules will be released in the future, so stay tuned. We'd love your feedback in the meantime!

The disaggregators library allows you to easily add new features to your datasets to enable disaggregated data exploration and disaggregated model evaluation. disaggregators is preloaded with disaggregation modules for text data, with image modules coming soon!

This library is intended to be used with 🤗 Datasets, but should work with any other "mappable" interface to a dataset.

Requirements and Installation

disaggregators has been tested on Python 3.8, 3.9, and 3.10.

pip install disaggregators will fetch the latest release from PyPI.

Note that some disaggregation modules require extra dependencies such as SpaCy modules, which may need to be installed manually. If these dependencies aren't installed, disaggregators will inform you about how to install them.

To install directly from this GitHub repo, use the following command:

pip install git+https://github.com/huggingface/disaggregators.git

Usage

You will likely want to use 🤗 Datasets with disaggregators.

pip install datasets

The snippet below loads the IMDB dataset from the Hugging Face Hub, and initializes a disaggregator for "pronoun" that will run on the IMDB dataset's "text" column. If you would like to run multiple disaggregations, you can pass a list to the Disaggregator constructor (e.g. Disaggregator(["pronoun", "sentiment"], column="text")). We then use the 🤗 Datasets map method to apply the disaggregation to the dataset.

from disaggregators import Disaggregator
from datasets import load_dataset

dataset = load_dataset("imdb", split="train")
disaggregator = Disaggregator("pronoun", column="text")

ds = dataset.map(disaggregator)  # New boolean columns are added for she/her, he/him, and they/them

The resulting dataset can now be used for data exploration and disaggregated model evaluation.

You can also run disaggregations on Pandas DataFrames with .apply and .merge:

from disaggregators import Disaggregator
import pandas as pd
df = pd.DataFrame({"text": ["They went to the park."]})

disaggregator = Disaggregator("pronoun", column="text")

new_cols = df.apply(disaggregator, axis=1)
df = pd.merge(df, pd.json_normalize(new_cols), left_index=True, right_index=True)

Available Disaggregation Modules

The following modules are currently available:

  • "age"
  • "gender"
  • "pronoun"
  • "religion"
  • "continent"

Note that disaggregators is in active development, and that these (and future) modules are subject to changing interfaces and implementations at any time before a 1.0.0 release. Each module provides its own method for overriding the default configuration, with the general interface documented below.

Module Configurations

Modules may make certain variables and functionality configurable. If you'd like to configure a module, import the module, its labels, and its config class. Then, override the labels and set the configuration as needed while instantiating the module. Once instantiated, you can pass the module to the Disaggregator. The example below shows this with the Age module.

from disaggregators import Disaggregator
from disaggregators.disaggregation_modules.age import Age, AgeLabels, AgeConfig

class MeSHAgeLabels(AgeLabels):
    INFANT = "infant"
    CHILD_PRESCHOOL = "child_preschool"
    CHILD = "child"
    ADOLESCENT = "adolescent"
    ADULT = "adult"
    MIDDLE_AGED = "middle_aged"
    AGED = "aged"
    AGED_80_OVER = "aged_80_over"

age = Age(
    config=AgeConfig(
        labels=MeSHAgeLabels,
        ages=[list(MeSHAgeLabels)],
        breakpoints=[0, 2, 5, 12, 18, 44, 64, 79]
    ),
    column="question"
)

disaggregator = Disaggregator([age, "gender"], column="question")

Custom Modules

Custom modules can be created by extending the CustomDisaggregator. All custom modules must have labels and a module_id, and must implement a __call__ method.

from disaggregators import Disaggregator, DisaggregationModuleLabels, CustomDisaggregator

class TabsSpacesLabels(DisaggregationModuleLabels):
    TABS = "tabs"
    SPACES = "spaces"

class TabsSpaces(CustomDisaggregator):
    module_id = "tabs_spaces"
    labels = TabsSpacesLabels

    def __call__(self, row, *args, **kwargs):
        if "\t" in row[self.column]:
            return {self.labels.TABS: True, self.labels.SPACES: False}
        else:
            return {self.labels.TABS: False, self.labels.SPACES: True}

disaggregator = Disaggregator(TabsSpaces, column="text")

Development

Development requirements can be installed with pip install .[dev]. See the Makefile for useful targets, such as code quality and test running.

To run tests locally across multiple Python versions (3.8, 3.9, and 3.10), ensure that you have all the Python versions available and then run nox -r. Note that this is quite slow, so it's only worth doing to double-check your code before you open a Pull Request.

Contact

Nima Boscarino – nima <at> huggingface <dot> co

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

disaggregators-0.1.2.tar.gz (17.6 kB view hashes)

Uploaded Source

Built Distribution

disaggregators-0.1.2-py3-none-any.whl (16.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page