SpeakLeash agnostic dataset for Polish
Project description
SpeakLeash is a lightweight library providing datasets for the Polish language and tools to make them useful.
- Website: https://speakleash.org/
- Datasets: https://speakleash.org/dashboard/
- Source code: https://github.com/speakleash/speakleash
- Data in action: https://github.com/speakleash/speakleash-examples
- Bug reports: https://github.com/speakleash/speakleash/issues
Installation
Speakleash package can be installed from PyPi and has to be installed in a virtual environment:
pip install speakleash
Basic Usage
If you just want to see the details of the datasets
from speakleash import Speakleash
import os
base_dir = os.path.join(os.path.dirname(__file__))
replicate_to = os.path.join(base_dir, "datasets")
sl = Speakleash(replicate_to)
for d in sl.datasets:
print(d.name)
for doc in d.data:
size_mb = round(d.characters/1024/1024)
print("Dataset: {0}, size: {1} MB, characters: {2}, documents: {3}".format(d.name, size_mb, d.characters, d.documents))
You can use individual properties (e.g.:characters, documents), but you can display the entire manifest
sl = Speakleash(replicate_to)
print(sl.get("plwiki").manifest)
If you chose one of them (.get(name of dataset)) then you will get a lot of text data ;-)
from speakleash import Speakleash
import os
base_dir = os.path.join(os.path.dirname(__file__))
replicate_to = os.path.join(base_dir, "datasets")
sl = Speakleash(replicate_to)
wiki = sl.get("plwiki").data
for doc in wiki:
print(doc[:40])
If you also need meta data then use the ext_data property
ds = sl.get("plwiki").ext_data
for doc in ds:
print(doc)
txt, meta = doc
print(meta.get("title"))
print(txt)
Popular meta data:
- title
- length
- sentences
- words
- verbs
- nouns
- symbols
- punctuations
Supported languages
On June 9, 2023, Croatia joined our projects. If you want to use Croatian language datasets just add lang parameter when creating Speakleash object.
from speakleash import Speakleash
import os
base_dir = os.path.join(os.path.dirname(__file__))
replicate_to = os.path.join(base_dir, "datasets")
sl = Speakleash(replicate_to, "hr")
for d in sl.datasets:
print(d.name)
for doc in d.data:
size_mb = round(d.characters/1024/1024)
print("Dataset: {0}, size: {1} MB, characters: {2}, documents: {3}".format(d.name, size_mb, d.characters, d.documents))
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for speakleash-0.3.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0033a6d056cd8fc64e1ab68ac06968990f84044fd8549a8e036caa16b62ee49c |
|
MD5 | 68769707b20d71a7b7136d3392beba9c |
|
BLAKE2b-256 | 9406730ee1100fe84662e6fd934fc3f49f497df008f8a142d2901e376465470b |