Integrated Corpus-Building Environment

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Expanda

The universal integrated corpus-building environment.

build GitHub

Introduction

Expanda is an integrated corpus-building environment. Expanda provides integrated pipelines for building a corpus dataset. Building corpus dataset requires several complicated pipelines such as parsing, shuffling, and tokenization. If the corpora are gathered from different applications, it would be a problem to parse various formats. Expanda helps to build corpus simply at once by setting build configuration.

For more information, see also documentation.

Main Features

Easy to build, simple for adding new extensions
Manages build environment systemically
Fast build through performance optimization (even written in Python)
Supports multi-processing
Extremely less memory usage
Don't need to write new codes for each corpus. Just write one line for adding a new corpus.

Dependencies

nltk
ijson
tqdm>=4.46.0
mwparserfromhell>=0.5.4
tokenizers>=0.7.0
kss==1.3.1

Installation

With pip

Expanda can be installed using pip as follows:

$ pip install expanda

From source

You can install from source by cloning the repository and running:

$ git clone https://github.com/affjljoo3581/Expanda.git
$ cd Expanda
$ python setup.py install

Build your first dataset

Let's build Wikipedia dataset by using Expanda. First of all, install Expanda.

$ pip install expanda

Next, create a workspace to build dataset by running:

$ mkdir workspace
$ cd workspace

Then, download Wikipedia dump file from here. In this example, we are going to test with part of the wiki. Download the file through the browser, move to workspace/src and rename to wiki.xml.bz2. Instead, run below code:

$ mkdir src
$ wget -O src/wiki.xml.bz2 https://dumps.wikimedia.org/enwiki/20200520/enwiki-20200520-pages-articles1.xml-p1p30303.bz2

After downloading the dump file, we need to setup the configuration file. Create expanda.cfg file and write the below:

[expanda.ext.wikipedia]
num-cores           = 6

[tokenization]
unk-token           = <unk>
control-tokens      = <s>
                      </s>
                      <pad>

[build]
input-files         =
    --expanda.ext.wikipedia     src/wiki.xml.bz2

The current directory structure of workspace should be as follows:

workspace
├── src
│   └── wiki.xml.bz2
└── expanda.cfg

Now we are ready to build! Run Expanda by using:

$ expanda build

Then we can get the below output:

[*] execute extension [expanda.ext.wikipedia] for [src/wiki.xml.bz2]
[nltk_data] Downloading package punkt to /home/user/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[*] merge extracted texts.
[*] start shuffling merged corpus...
[*] optimum stride: 17, buckets: 34
[*] create temporary bucket files.
[*] successfully shuffle offsets. total offsets: 102936
[*] shuffle input file: 100%|████████████████████| 102936/102936 [00:02<00:00, 34652.03it/s]
[*] start copying buckets to the output file.
[*] finish copying buckets. remove the buckets...
[*] complete preparing corpus. start training tokenizer...
[00:00:59] Reading files                            ████████████████████                 100
[00:00:04] Tokenize words                           ████████████████████ 405802   /   405802
[00:00:00] Count pairs                              ████████████████████ 405802   /   405802
[00:00:01] Compute merges                           ████████████████████ 6332     /     6332

[*] create tokenized corpus.
[*] tokenize corpus: 100%|█████████████████████| 1749902/1749902 [00:28<00:00, 61958.55it/s]
[*] split the corpus into train and test dataset.
[*] remove temporary directory.
[*] finish building corpus.

If you build dataset successfully, you can get the following directory tree:

workspace
├── build
│   ├── corpus.raw.txt
│   ├── corpus.train.txt
│   ├── corpus.test.txt
│   └── vocab.txt
├── src
│   └── wiki.xml.bz2
└── expanda.cfg

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

1.3.1

Jun 27, 2020

1.3.0

Jun 27, 2020

1.2.2

Jun 27, 2020

1.2.1

Jun 26, 2020

1.2.0

Jun 26, 2020

1.1.5

May 31, 2020

1.1.4

May 31, 2020

1.1.3

May 27, 2020

1.1.2

May 27, 2020

1.1.1

May 26, 2020

1.1.0

May 26, 2020

1.0.2

May 26, 2020

1.0.1

May 25, 2020

1.0.0

May 25, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Expanda-1.3.1.tar.gz (13.3 kB view hashes)

Uploaded Jun 27, 2020 Source

Built Distribution

Expanda-1.3.1-py3-none-any.whl (21.2 kB view hashes)

Uploaded Jun 27, 2020 Python 3

Hashes for Expanda-1.3.1.tar.gz

Hashes for Expanda-1.3.1.tar.gz
Algorithm	Hash digest
SHA256	`a0f0a83f997c3243b90318822997d3fbbead272a7765e239cab437f15c88a8f7`
MD5	`0376e14eaabc6c3bc57ae96299daa343`
BLAKE2b-256	`d13f37d91da7db21350d7e4885070c746819f0c6500d52bbcdf64d31a5d86eda`

Hashes for Expanda-1.3.1-py3-none-any.whl

Hashes for Expanda-1.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`89017e481b3079b520c3f129e9c7ec0bfcbc15ce5743c11f826f7f1829117629`
MD5	`eaeab1f6d83e99105f1b9a869998fe5f`
BLAKE2b-256	`a0c1e39eaceebeb3dd8639b5bc9d08af318e2674b3fd2ebfab430c875d0a9206`