speakleash

SpeakLeash agnostic dataset for Polish

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

SpeakLeash is a lightweight library providing datasets for the Polish language and tools to make them useful.

Website: https://speakleash.org/
Datasets: https://speakleash.org/dashboard/
Source code: https://github.com/speakleash/speakleash
Data in action: https://github.com/speakleash/speakleash-examples
Bug reports: https://github.com/speakleash/speakleash/issues

Installation

Speakleash package can be installed from PyPi and has to be installed in a virtual environment:

pip install speakleash

Basic Usage

If you just want to see the details of the datasets

from speakleash import Speakleash
import os

base_dir = os.path.join(os.path.dirname(__file__))
replicate_to = os.path.join(base_dir, "datasets")

sl = Speakleash(replicate_to)

for d in sl.datasets:
    print(d.name)
    for doc in d.data:
        size_mb = round(d.characters/1024/1024)
        print("Dataset: {0}, size: {1} MB, characters: {2}, documents: {3}".format(d.name, size_mb, d.characters, d.documents))

You can use individual properties (e.g.:characters, documents), but you can display the entire manifest

sl = Speakleash(replicate_to)
print(sl.get("plwiki").manifest)

If you chose one of them (.get(name of dataset)) then you will get a lot of text data ;-)

from speakleash import Speakleash
import os

base_dir = os.path.join(os.path.dirname(__file__))
replicate_to = os.path.join(base_dir, "datasets")

sl = Speakleash(replicate_to)

wiki = sl.get("plwiki").data
for doc in wiki:
    print(doc[:40])

If you also need meta data then use the ext_data property

ds = sl.get("plwiki").ext_data
for doc in ds:
    print(doc)
    txt, meta = doc
    print(meta.get("title"))
    print(txt)

Popular meta data:

title
length
sentences
words
verbs
nouns
symbols
punctuations

Supported languages

On June 9, 2023, Croatia joined our projects. If you want to use Croatian language datasets just add lang parameter when creating Speakleash object.

from speakleash import Speakleash
import os

base_dir = os.path.join(os.path.dirname(__file__))
replicate_to = os.path.join(base_dir, "datasets")

sl = Speakleash(replicate_to, "hr")

for d in sl.datasets:
    print(d.name)
    for doc in d.data:
        size_mb = round(d.characters/1024/1024)
        print("Dataset: {0}, size: {1} MB, characters: {2}, documents: {3}".format(d.name, size_mb, d.characters, d.documents))

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.3.4

Oct 15, 2023

0.3.3

Sep 13, 2023

0.2.3

Jul 28, 2023

0.2.1

Jul 4, 2023

0.2.0

Jul 4, 2023

0.1.1

Jun 9, 2023

0.1.0

Jun 9, 2023

0.0.14

May 7, 2023

0.0.13

Feb 9, 2023

0.0.12

Feb 4, 2023

0.0.11

Jan 25, 2023

0.0.10

Jan 21, 2023

0.0.9

Dec 12, 2022

0.0.8

Dec 11, 2022

0.0.7

Dec 10, 2022

0.0.6

Dec 10, 2022

0.0.5

Dec 10, 2022

0.0.4

Dec 10, 2022

0.0.3

Dec 9, 2022

0.0.2

Dec 9, 2022

0.0.1

Dec 9, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

speakleash-0.3.4.tar.gz (12.4 kB view hashes)

Uploaded Oct 15, 2023 Source

Built Distribution

speakleash-0.3.4-py3-none-any.whl (14.3 kB view hashes)

Uploaded Oct 15, 2023 Python 3

Hashes for speakleash-0.3.4.tar.gz

Hashes for speakleash-0.3.4.tar.gz
Algorithm	Hash digest
SHA256	`e194668204376fbe50317d2c292b6b64f85cfe61e5a728999f16f5013c9c05a0`
MD5	`8e5abd4fe7b8ab8347d47994aaf52062`
BLAKE2b-256	`f8ede8b335ed8c51145b1d006f6ce1e227e5278c1c1c81f5d774f4c1e9a84f1f`

Hashes for speakleash-0.3.4-py3-none-any.whl

Hashes for speakleash-0.3.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0033a6d056cd8fc64e1ab68ac06968990f84044fd8549a8e036caa16b62ee49c`
MD5	`68769707b20d71a7b7136d3392beba9c`
BLAKE2b-256	`9406730ee1100fe84662e6fd934fc3f49f497df008f8a142d2901e376465470b`