Export UNIHAN data of Chinese, Japanese, Korean to CSV, JSON or YAML

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

unihan-etl ·

An ETL tool for the Unicode Han Unification (UNIHAN) database releases. unihan-etl is designed to fetch (download), unpack (unzip), and convert the database from the Unicode website into either a flattened, tabular format or a structured, hierarchical format.

unihan-etl serves dual purposes: as a Python library offering an API for accessing data as Python objects, and as a command-line interface (CLI) for exporting data into CSV, JSON, or YAML formats.

This tool is a component of the cihai suite of CJK related projects. For a similar tool, see libUnihan.

As of v0.31.0, unihan-etl is compatible with UNIHAN Version 15.1.0 (released on 2023-09-01, revision 35).

The UNIHAN database

The UNIHAN database organizes data across multiple files, exemplified below:

U+3400	kCantonese		jau1
U+3400	kDefinition		(same as U+4E18 丘) hillock or mound
U+3400	kMandarin		qiū
U+3401	kCantonese		tim2
U+3401	kDefinition		to lick; to taste, a mat, bamboo bark
U+3401	kHanyuPinyin		10019.020:tiàn
U+3401	kMandarin		tiàn

Values vary in shape and structure depending on their field type. kHanyuPinyin maps Unicode codepoints to Hànyǔ Dà Zìdiǎn, where 10019.020:tiàn represents an entry. Complicating it further, more variations:

U+5EFE	kHanyuPinyin		10513.110,10514.010,10514.020:gǒng
U+5364	kHanyuPinyin		10093.130:xī,lǔ 74609.020:lǔ,xī

kHanyuPinyin supports multiple entries delimited by spaces. ":" (colon) separate locations in the work from pinyin readings. "," (comma) separate multiple entries/readings. This is just one of 90 fields contained in the database.

Tabular, "Flat" output

CSV (default)

$ unihan-etl

char,ucn,kCantonese,kDefinition,kHanyuPinyin,kMandarin
㐀,U+3400,jau1,(same as U+4E18 丘) hillock or mound,,qiū
㐁,U+3401,tim2,"to lick; to taste, a mat, bamboo bark",10019.020:tiàn,tiàn

With $ unihan-etl -F yaml --no-expand:

- char: 㐀
  kCantonese: jau1
  kDefinition: (same as U+4E18 丘) hillock or mound
  kHanyuPinyin: null
  kMandarin: qiū
  ucn: U+3400
- char: 㐁
  kCantonese: tim2
  kDefinition: to lick; to taste, a mat, bamboo bark
  kHanyuPinyin: 10019.020:tiàn
  kMandarin: tiàn
  ucn: U+3401

To preview in the CLI, try tabview or csvlens.

JSON

$ unihan-etl -F json --no-expand

[
  {
    "char": "㐀",
    "ucn": "U+3400",
    "kDefinition": "(same as U+4E18 丘) hillock or mound",
    "kCantonese": "jau1",
    "kHanyuPinyin": null,
    "kMandarin": "qiū"
  },
  {
    "char": "㐁",
    "ucn": "U+3401",
    "kDefinition": "to lick; to taste, a mat, bamboo bark",
    "kCantonese": "tim2",
    "kHanyuPinyin": "10019.020:tiàn",
    "kMandarin": "tiàn"
  }
]

Tools:

View in CLI: python-fx, jless or fx.
Filter via CLI: jq, jql, gojq.

YAML

$ unihan-etl -F yaml --no-expand

- char: 㐀
  kCantonese: jau1
  kDefinition: (same as U+4E18 丘) hillock or mound
  kHanyuPinyin: null
  kMandarin: qiū
  ucn: U+3400
- char: 㐁
  kCantonese: tim2
  kDefinition: to lick; to taste, a mat, bamboo bark
  kHanyuPinyin: 10019.020:tiàn
  kMandarin: tiàn
  ucn: U+3401

Filter via the CLI with yq.

"Structured" output

Codepoints can pack a lot more detail, unihan-etl carefully extracts these values in a uniform manner. Empty values are pruned.

To make this possible, unihan-etl exports to JSON, YAML, and python list/dicts.

Why not CSV?

Unfortunately, CSV is only suitable for storing table-like information. File formats such as JSON and YAML accept key-values and hierarchical entries.

JSON

$ unihan-etl -F json

[
  {
    "char": "㐀",
    "ucn": "U+3400",
    "kDefinition": ["(same as U+4E18 丘) hillock or mound"],
    "kCantonese": ["jau1"],
    "kMandarin": {
      "zh-Hans": "qiū",
      "zh-Hant": "qiū"
    }
  },
  {
    "char": "㐁",
    "ucn": "U+3401",
    "kDefinition": ["to lick", "to taste, a mat, bamboo bark"],
    "kCantonese": ["tim2"],
    "kHanyuPinyin": [
      {
        "locations": [
          {
            "volume": 1,
            "page": 19,
            "character": 2,
            "virtual": 0
          }
        ],
        "readings": ["tiàn"]
      }
    ],
    "kMandarin": {
      "zh-Hans": "tiàn",
      "zh-Hant": "tiàn"
    }
  }
]

YAML

$ unihan-etl -F yaml

- char: 㐀
  kCantonese:
    - jau1
  kDefinition:
    - (same as U+4E18 丘) hillock or mound
  kMandarin:
    zh-Hans: qiū
    zh-Hant: qiū
  ucn: U+3400
- char: 㐁
  kCantonese:
    - tim2
  kDefinition:
    - to lick
    - to taste, a mat, bamboo bark
  kHanyuPinyin:
    - locations:
        - character: 2
          page: 19
          virtual: 0
          volume: 1
      readings:
        - tiàn
  kMandarin:
    zh-Hans: tiàn
    zh-Hant: tiàn
  ucn: U+3401

Features

automatically downloads UNIHAN from the internet
strives for accuracy with the specifications described in UNIHAN's database design
export to JSON, CSV and YAML (requires pyyaml) via -F
configurable to export specific fields via -f
accounts for encoding conflicts due to the Unicode-heavy content
designed as a technical proof for future CJK (Chinese, Japanese, Korean) datasets
core component and dependency of cihai, a CJK library
data package support
expansion of multi-value delimited fields in YAML, JSON and python dictionaries
supports >= 3.7 and pypy

If you encounter a problem or have a question, please create an issue.

Installation

To download and build your own UNIHAN export:

$ pip install --user unihan-etl

or by pipx:

$ pipx install unihan-etl

Developmental releases

pip:

$ pip install --user --upgrade --pre unihan-etl

pipx:

$ pipx install --suffix=@next 'unihan-etl' --pip-args '\--pre' --force
// Usage: unihan-etl@next load yoursession

Usage

unihan-etl offers customizable builds via its command line arguments.

See unihan-etl CLI arguments for information on how you can specify columns, files, download URL's, and output destination.

To output CSV, the default format:

$ unihan-etl

To output JSON:

$ unihan-etl -F json

To output YAML:

$ pip install --user pyyaml
$ unihan-etl -F yaml

To only output the kDefinition field in a csv:

$ unihan-etl -f kDefinition

To output multiple fields, separate with spaces:

$ unihan-etl -f kCantonese kDefinition

To output to a custom file:

$ unihan-etl --destination ./exported.csv

To output to a custom file (templated file extension):

$ unihan-etl --destination ./exported.{ext}

See unihan-etl CLI arguments for advanced usage examples.

Code layout

# cache dir (Unihan.zip is downloaded, contents extracted)
{XDG cache dir}/unihan_etl/

# output dir
{XDG data dir}/unihan_etl/
  unihan.json
  unihan.csv
  unihan.yaml   # (requires pyyaml)

# package dir
unihan_etl/
  core.py    # argparse, download, extract, transform UNIHAN's data
  options.py    # configuration object
  constants.py  # immutable data vars (field to filename mappings, etc)
  expansion.py  # extracting details baked inside of fields
  types.py      # type annotations
  util.py       # utility / helper functions

# test suite
tests/*

API

The package is python underneath the hood, you can utilize its full API. Example:

>>> from unihan_etl.core import Packager
>>> pkgr = Packager()
>>> hasattr(pkgr.options, 'destination')
True

Developing

$ git clone https://github.com/cihai/unihan-etl.git

$ cd unihan-etl

Bootstrap your environment and learn more about contributing. We use the same conventions / tools across all cihai projects: pytest, sphinx, mypy, ruff, tmuxp, and file watcher helpers (e.g. entr(1)).

More information

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.34.0

Mar 24, 2024

0.33.1

Feb 9, 2024

0.33.0

Feb 9, 2024

0.32.0

Feb 5, 2024

0.31.0

Feb 4, 2024

0.30.1

Dec 10, 2023

0.30.0.post0

Nov 26, 2023

0.30.0

Nov 26, 2023

0.29.0

Nov 19, 2023

0.28.1

Sep 2, 2023

0.28.0

Jul 22, 2023

0.27.0

Jul 18, 2023

0.27.0a1 pre-release

Jul 18, 2023

0.27.0a0 pre-release

Jul 18, 2023

0.26.0

Jul 9, 2023

0.25.2

Jul 8, 2023

0.25.1

Jul 8, 2023

0.25.0

Jul 1, 2023

0.24.0

Jun 24, 2023

0.23.0

Jun 24, 2023

0.22.1

Jun 18, 2023

0.22.0

Jun 17, 2023

0.21.1

Jun 18, 2023

0.21.0

Jun 12, 2023

0.20.0

Jun 11, 2023

0.19.1

May 28, 2023

0.19.0

May 27, 2023

0.18.2

May 13, 2023

0.18.1

Oct 1, 2022

0.18.1a0 pre-release

Sep 18, 2022

0.18.0

Sep 11, 2022

0.17.2

Aug 21, 2022

0.17.1

Aug 21, 2022

0.17.0

Aug 21, 2022

0.16.0

Aug 21, 2022

0.15.0

Aug 20, 2022

0.14.0

Aug 16, 2022

0.14.0a1 pre-release

Aug 16, 2022

0.14.0a0 pre-release

May 22, 2022

0.13.0

Jun 16, 2021

0.12.0

Jun 15, 2021

0.11.0

Aug 9, 2020

0.11.0a2 pre-release

Aug 9, 2020

0.11.0a1 pre-release

Aug 9, 2020

0.10.4

Aug 5, 2020

0.10.3

Aug 18, 2019

0.10.2

Aug 17, 2019

0.10.1

Sep 8, 2018

0.10.0

Jul 29, 2018

0.9.5

Jun 26, 2017

0.9.4

Jun 5, 2017

0.9.3

May 31, 2017

0.9.2

May 31, 2017

0.9.1

May 28, 2017

0.9.0

May 26, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unihan_etl-0.34.0.tar.gz (69.4 kB view hashes)

Uploaded Mar 24, 2024 Source

Built Distribution

unihan_etl-0.34.0-py3-none-any.whl (58.8 kB view hashes)

Uploaded Mar 24, 2024 Python 3

Hashes for unihan_etl-0.34.0.tar.gz

Hashes for unihan_etl-0.34.0.tar.gz
Algorithm	Hash digest
SHA256	`1a596f28982fc9ee172d50ed44b025c4bf4f7403bba2e14c8933571ea08fba21`
MD5	`8aa660825a656662a7c557c26987754e`
BLAKE2b-256	`6b0a840afb05bdbb341bc672eba9fb5da78a0a55f7f5995eff3662493927bc53`

Hashes for unihan_etl-0.34.0-py3-none-any.whl

Hashes for unihan_etl-0.34.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`12c9d45f9697be86497e70189c4b833f406c1936d6e9e511ecbffd68d80648cd`
MD5	`09fae33a3d83e1b90644532d2be7641f`
BLAKE2b-256	`4d9cae602992a46a9773a49deaf765b367d1655996fd565b8fa176fbdc040984`

unihan-etl 0.34.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

unihan-etl ·

The UNIHAN database

Tabular, "Flat" output

CSV (default)

JSON

YAML

"Structured" output

JSON

YAML

Features

Installation

Developmental releases

Usage

Code layout

API

Developing

More information

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution