Skip to main content

Text classification datasets

Project description

Logo

made-with-python PyPI version PyPI - License Madein

The framework is designed with BERT in mind and currently support seven commonsense reasoning datsets(alphanli, hellaswag, physicaliqa, socialiqa, codah, cosmosqa, and commonsenseqa). It can be also applied to other datasets with few line of codes.

Architecture

Architecture Image

Dependency

conda install av -c conda-forge
pip install -r requirements.txt
pip install --editable .

# or

pip install textbook

Download raw datasets

./fetch.sh

It downloads alphanli, hellaswag, physicaliqa, socialiqa, codah, cosmosqa, and commonsenseqa from AWS in data_cache. In case you want to use something-something, pelase download the dataset from 20bn's website.

Usage

Template

The goal of a template is to transform raw text into a intermediate datum where abstractive information are provided for later use.

Ideally, the template should do the following things:

  • construct text: a list of list. The outer list is ideal for multichoice situations and inner list if for each input pairs/triplets(e.g context, question, and choice);
  • construct label: an integer representing a zero-indexed label for the truth, or None;
  • construct token_type_id and attention: abstractive representation of the segment id and attention. In the following example of anli, both token_type_id and attention have three digits, each for the three components of each row of the text.
  • construct image: any forms of image id/path you want to read later.

One example of anli is as follows:

# raw
case = {"story_id": "58090d3f-8a91-4c89-83ef-2b4994de9d241", "obs1": "Ron started his new job as a landscaper today.",
        "obs2": "Ron is immediately fired for insubordination.", "hyp1": "Ron ignores his bosses's orders and called him an idiot.",
        "hyp2": "Ron's boss called him an idiot.", "label": "1"}

# target intermediate datum
target = {
    'text':
    [['Ron started his new job as a landscaper today.', "Ron ignores his bosses's orders and called him an idiot.",
        'Ron is immediately fired for insubordination.'],
        ['Ron started his new job as a landscaper today.', "Ron's boss called him an idiot.",
        'Ron is immediately fired for insubordination.']],
    'label': 0, 'image': None, 'token_type_id': [0, 1, 0],
    'attention': [1, 1, 1]}

LABEL2INT = {
    "anli": {
        "1": 0,
        "2": 1,
    },
}
assert template_anli(case, LABEL2INT['anli']) == target

Renderer

Renderer transformer your intermediate datum into a fully blown datum. Each renderer takes care of different part of the datum. For example, renderer_text renders the text into input_id and generate all token-based attention and token_type_id, while renderer_video renders the image path to an image tensor. renderers are passed to the dataset constructer in a list, therefore are execute sequentially.

BatchTool

We provided a BatchTool where MLM or padding can be used easily, you can check the doc for the class for more information.

Load a dataset with pandas

from transformers import BertTokenizer
from textbook import MultiModalDataset, template_anli, renderer_text, BatchTool, TokenBasedSampler
from torch.utils.data import Dataset, DataLoader
from textbook import LABEL2INT
import pandas as pd

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

d1 = MultiModalDataset(
    df=pd.read_json("data_cache/alphanli/train.jsonl", lines=True),
    template=lambda x: template_anli(x, LABEL2INT['anli']),
    renderers=[lambda x: renderer_text(x, tokenizer)],
)
bt1 = BatchTool(tokenizer, source="anli")
i1 = DataLoader(d1, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt1.collate_fn)

Create a multitask dataset with multiple datasets

from transformers import BertTokenizer
from textbook import *
import pandas as pd

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

# add additional tokens for each task as special `cls_token`
tokenizer.add_special_tokens({"additional_special_tokens": [
        "[ANLI]", "[HELLASWAG]"
]})

d1 = MultiModalDataset(
    df=pd.read_json("data_cache/alphanli/train.jsonl", lines=True),
    template=lambda x: template_anli(x, LABEL2INT['anli']),
    renderers=[lambda x: renderer_text(x, tokenizer, "[ANLI]")],
)
bt1 = BatchTool(tokenizer, source="anli")
i1 = DataLoader(d1, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt1.collate_fn)

d2 = MultiModalDataset(
        df=pd.read_json("data_cache/hellaswag/train.jsonl", lines=True),
        template=lambda x: template_hellaswag(x, LABEL2INT['hellaswag']),
        renderers=[lambda x: renderer_text(x, tokenizer, "[HELLASWAG]")],
    )
bt2 = BatchTool(tokenizer, source="hellaswag")
i2 = DataLoader(d2, batch_sampler=TokenBasedSampler(d1, batch_size=64), collate_fn=bt2.collate_fn)

d = MultiTaskDataset([i1, i2], shuffle=False)

#! batch size must be 1 for multitaskdataset, because we already batched in each sub dataset.
for batch in DataLoader(d, batch_size=1, collate_fn=BatchTool.uncollate_fn):

    pass

    # {
    #     "source": "anli" or "hellaswag",
    #     "labels": ...,
    #     "input_ids": ...,
    #     "attentions": ...,
    #     "token_type_ids": ...,
    #     "images": ...,
    # }

Impletement a New template or rennderer

It is advised to follow the following conventions but you can do whatever you like since you can call lambda anywhere.

def template_xxx(raw_datum, *args, **kwargs):
    pass

def renderer_xxx(intermediate_datum, *args, **kwargs):
    pass

e.g. For Quora question pairs dataset:

def template_qqp(raw_datum, label2int={"0": 0, "1": 1},):

    result = {
        "text": [
            [datum['question1'], datum['question2']]
        ],
        "image": None,
        "label": None if 'is_duplicate' not in datum or datum['is_duplicate'] is None else label2int[str(datum['i_duplicate'])],
        "token_type_id": [0, 1],
        "attention": [1, 1],
    }

    return result

Contact

Author: Chenghao Mou

Email: mouchenghao@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textbook-0.3.10.tar.gz (13.0 kB view hashes)

Uploaded Source

Built Distribution

textbook-0.3.10-py3-none-any.whl (12.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page