Scikit-learn compatible Japanese text vectorizer for CPU-based Japanese natural language processing.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

sumire

形態素解析器などの事前インストールなしで使える, CPUベースの日本語自然言語処理のための, Scikit-learn互換の日本語の単語分割器と, テキストのベクトル化ツール.

Lint Test Coverage

Table of Contents

sumire

Installation

pre-requirements

Tested OS: ubuntu 22.04.
python >=3.9
make
cmake
git

# Jumanpp dependencies.
sudo apt update -y;
sudo apt install -y cmake libeigen3-dev libprotobuf-dev protobuf-c-compiler;

pip install sumireだけで, MeCabもJumanppも, インストールなしで使えます. MeCabやJumanppの実行バイナリや各種辞書がなければ, $HOME/.local/sumire/にTokenizerをインスタンス化した時にインストールされます.

Usage

Tokenizer usage

from sumire.tokenizer import MecabTokenizer, JumanppTokenizer


text = "これはテスト文です。" 
texts = ["これはテスト文です。", "別のテキストもトークン化します。"]

mecab = MecabTokenizer("unidic-lite")
text_mecab_tokenized = mecab.tokenize(text)
texts_mecab_tokenized = mecab.tokenize(texts)

jumanpp = JumanppTokenizer()
jumanpp.tokenize(text)
text_jumanpp_tokenized = jumanpp.tokenize(text)
texts_jumanpp_tokenized = jumanpp.tokenize(texts)

Vectorizer usage

from sumire.tokenizer.mecab import MecabTokenizer
from sumire.vectorizer.count import CountVectorizer
from sumire.vectorizer.swem import W2VSWEMVectorizer
from sumire.vectorizer.transformer_emb import TransformerEmbeddingVectorizer

texts = ["これはテスト文です。", "別のテキストもトークン化します。"]

count_vectorizer = CountVectorizer()  # this automatically use MecabTokenizer()
swem_vectorizer = W2VSWEMVectorizer()
bert_cls_vectorizer = TransformerEmbeddingVectorizer()

# fit and transform at the same time. (Of course, you can .fit() and .transform() separately!)
count_vectorized = count_vectorizer.fit_transform(texts)
swem_vectorized = swem_vectorizer.fit_transform(texts)
bert_cls_vectorized = bert_cls_vectorizer.fit_transform(texts)

# save and load vectorizer.
count_vectorizer.save_pretrained("path/to/count_vectorizer")
count_vectorizer = CountVectorizer.from_pretrained("path/to/count_vectorizer")
swem_vectorizer.save_pretrained("path/to/swem_vectorizer")
swem_vectorizer = W2VSWEMVectorizer.from_pretrained("path/to/swem_vectorizer")
bert_cls_vectorizer.save_pretrained("path/to/bert_cls_vectorizer")
bert_cls_vectorizer = TransformerEmbeddingVectorizer.from_pretrained("path/to/beert_cls_vectorizer")

各単語分割器や文分散表現モジュールの詳細なドキュメントはドキュメントページを参照してください. また, Transformersやgensimの動作済みmodelの情報は, /sumire/resources/model_cardを参照してください.

Development background

LLMの隆盛に伴い, 検索, 感情分析, その他テキスト分類・回帰などの日本語のNLPの実用タスクへの注目も高まりつつあります. これらの基本的なタスクにおいて, 日本語のテキストを単語分割や, 単語や文の分散表現を得ることは, 最も基礎的な処理の一つです. LLMの時代において, BERTなどの事前訓練済みTransformerモデルや, Open AI APIによるEmbeddingsは, テキスト分散表現技術において最も重要な技術であることはいうまでもなく, また, 簡単に実装できるといえば実装できます.

しかし, 実用の現場において, BERTや, OpenAI APIなどの, 高価なGPUが必要な手法や, 1 Queryごとに費用が発生するAPIを用いた最先端の手法を使うことは, 計算量・運用コストの両面から負荷が軽いとはいえません. また, データセット構築段階などのプロジェクトの初期段階での概念実証 (PoC) において, 辞書データや形態素解析器の~~めんどくさい~~インストール作業や, それぞれやや異なるAPIのメソッドやプロパティを調べながら作業を行うのは少しばかり手間です.

これらの点を踏まえて, GPUがあるとは限らない手元環境で, PoCにおけるモデリング・分析部分へ速やかに注力できように, Scikit-learnのように, 機能ごとに統一的なAPIインターフェースで, テキストを与えればとりあえず色々な文の分散表現を取得できるライブラリを開発しました.

Unmotivated development tasks (at this moment.)

Open-AI Embedding modelを使うこと. (高い.)
事前訓練済みTransformerモデルによるEmbeddingについて, GPUが必要なチューニング機能を実装すること. (手元にGPUがない.)
実行速度のためにライブラリ内部の可読性を大きく下げること.
- 小規模なPoCにおいて, コードの実行速度より, 実装速度のほうが重要だと考えています.
- PoC後の大規模な運用にて, 速度やディスク容量が問題になった場合があれば, 本ライブラリ中の不要な処理をそれぞれの開発者が削除したりカスタマイズしやすいように, 可読性を維持したいです.

Roadmap (motivated development tasks)

vectorizer inputsのdecode().
Google colabでの動作環境検証.

Coding rule

https://pep8-ja.readthedocs.io/ja/latest/

License

sumire is distributed under the terms of the MIT License.

Acknowledgements (Dependent libraries, data, and models.)

See dependent_licenses.csv.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

1.0.2

Jan 31, 2024

1.0.1

Dec 16, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sumire-1.0.2.tar.gz (24.7 kB view hashes)

Uploaded Jan 31, 2024 Source

Built Distribution

sumire-1.0.2-py3-none-any.whl (54.3 kB view hashes)

Uploaded Jan 31, 2024 Python 3

Hashes for sumire-1.0.2.tar.gz

Hashes for sumire-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`3e3f024534cf3d76f55f219805784e2c88906cb5125f10244b97e0d63b02846a`
MD5	`b2797b32f682c78be7519b8272fe8af9`
BLAKE2b-256	`7973f6ac3c93d52d844388ffaf2827718b60edc550b00e1d4a613a60c26ac8bd`

Hashes for sumire-1.0.2-py3-none-any.whl

Hashes for sumire-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d679fbc08104d603d8a8882e62119150eea6433f2980b3bf9b7184dd913c1224`
MD5	`7c36f4645ed9e308a93445b8b03d79f2`
BLAKE2b-256	`65d7df6560c5c1250144978d1dc3615a600a50657ac027186f8758a1f30dfd14`