Extracts data from German Wiktionary dump files.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

License
- OSI Approved :: MIT License
Natural Language
- German
Operating System
- OS Independent
Programming Language
Topic
- Text Processing :: Markup :: XML

Project description

wiktionary-de-parser

A Python module to extract data from German Wiktionary XML files (for Python 3.11+).

Features

Extracts IPA transcriptions, hyphenation, language, part of speech information (basic), genus and flexion tables of a word.
Yields per entry, not per page (a page can have multiple entries/ words can have different meanings)

Installation

pip install wiktionary-de-parser

Or with Poetry:

poetry add wiktionary-de-parser

Usage

Loading the XML dump file

from wiktionary_de_parser import WiktionaryParser
from wiktionary_de_parser.dump_processor import WiktionaryDump

# To download the dump file, specify the directory where the
# dump file should be stored.
dump = WiktionaryDump(dump_dir_path="directory-of-dump-file")

# This will download "dewiktionary-latest-pages-articles-multistream.xml.bz2" to
# the directory specified in `dump_dir_path`.
dump.download_dump()

# Alternatively you can specify a different dump file to download.
dump = WiktionaryDump(
    dump_dir_path="directory-of-dump-file",
    dump_download_url="url-to-dump-file.xml.bz2",
)
dump.download_dump()

# If you already have the dump file locally, specify the path to the file.
dump = WiktionaryDump(dump_file_path="path-to-dump-file.xml.bz2")
dump.download_dump()

Parsing the dump file

from pprint import pprint
from wiktionary_de_parser import WiktionaryParser

# ... (see above)

parser = WiktionaryParser()

for page in dump.pages():
    # Skip redirects
    if page.redirect_to:
        continue

    if page.name == "Abend":
        # Parse all entries for "Abend"
        for entry in parser.entries_from_page(page):
            results = parser.parse_entry(entry)
            pprint(results)
        break

Output

All page entries for "Abend":

ParsedWiktionaryPageEntry(
    name="Abend",
    hyphenation=["Abend"],
    flexion={
        "Genus": "m",
        "Nominativ Singular": "Abend",
        "Nominativ Plural": "Abende",
        "Genitiv Singular": "Abends",
        "Genitiv Plural": "Abende",
        "Dativ Singular": "Abend",
        "Dativ Plural": "Abenden",
        "Akkusativ Singular": "Abend",
        "Akkusativ Plural": "Abende",
    },
    ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
    language=Language(lang="Deutsch", lang_code="de"),
    lemma=Lemma(lemma="Abend", inflected=False),
    pos={"Substantiv": []},
    rhymes=["aːbn̩t"],
)
ParsedWiktionaryPageEntry(
    name="Abend",
    hyphenation=["Abend"],
    flexion=None,
    ipa=["ˈaːbn̩t"],
    language=Language(lang="Deutsch", lang_code="de"),
    lemma=Lemma(lemma="Abend", inflected=False),
    pos={"Substantiv": ["Nachname"]},
    rhymes=["aːbn̩t"],
)
ParsedWiktionaryPageEntry(
    name="Abend",
    hyphenation=["Abend"],
    flexion=None,
    ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
    language=Language(lang="Deutsch", lang_code="de"),
    lemma=Lemma(lemma="Abend", inflected=False),
    pos={"Substantiv": ["Toponym"]},
    rhymes=["aːbn̩t"],
)

Development

This project uses Poetry.

Install Poetry.
Clone this repository
Run poetry install inside of the project folder to install dependencies.
There is a notebook.ipynb to test the parser.
Run poetry run pytest to run tests.

License

MIT © Gregor Weichbrodt

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

License
- OSI Approved :: MIT License
Natural Language
- German
Operating System
- OS Independent
Programming Language
Topic
- Text Processing :: Markup :: XML

Release history Release notifications | RSS feed

This version

0.11.5

Feb 10, 2024

0.11.4

Feb 10, 2024

0.11.3

Feb 10, 2024

0.11.2

Feb 9, 2024

0.11.1

Feb 5, 2024

0.11.0

Feb 4, 2024

0.10.1

Jan 29, 2024

0.10.0

Jan 29, 2024

0.9.5

Jul 26, 2022

0.9.4

Jul 18, 2022

0.9.3

Jul 18, 2022

0.9.2

Jul 17, 2022

0.9.1

Jul 15, 2022

0.9.0

Jul 15, 2022

0.8.9

Nov 13, 2021

0.8.8

Nov 12, 2021

0.8.7

Nov 12, 2021

0.8.6

Nov 12, 2021

0.8.5

Nov 12, 2021

0.8.4

Nov 12, 2021

0.8.3

Nov 12, 2021

0.8.2

Nov 10, 2021

0.8.1

Jul 9, 2020

0.8.0

Dec 1, 2019

0.7.9

Dec 1, 2019

0.7.8

Dec 1, 2019

0.7.7

Jul 16, 2019

0.7.6

Jul 13, 2019

0.7.5

Jul 13, 2019

0.7.4

Jul 13, 2019

0.7.3

May 29, 2019

0.7.2

May 29, 2019

0.7.1

May 27, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wiktionary_de_parser-0.11.5.tar.gz (16.0 kB view hashes)

Uploaded Feb 10, 2024 Source

Built Distribution

wiktionary_de_parser-0.11.5-py3-none-any.whl (20.8 kB view hashes)

Uploaded Feb 10, 2024 Python 3

Hashes for wiktionary_de_parser-0.11.5.tar.gz

Hashes for wiktionary_de_parser-0.11.5.tar.gz
Algorithm	Hash digest
SHA256	`bbc8c91e302e74a6ef5329952dd16c5df388fe36f23787ccdd5fba94799b3da5`
MD5	`785e0e97a700a8f5c6e4d88834badc58`
BLAKE2b-256	`f2185cbcb0ec3854f178f4333c0093e508c48aafb7a97b9890ee46522e2ccd2d`

Hashes for wiktionary_de_parser-0.11.5-py3-none-any.whl

Hashes for wiktionary_de_parser-0.11.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f39505af6c7b6b2c321abc2ad772faed4922286e9a3a06764c3aa1ce64cc8f98`
MD5	`e3183e9b8f0ce124a47738e30df0fe4a`
BLAKE2b-256	`6b871c372cd25eeeeb37fe45c84c8add1ef67b4ed2a09cebe96dc18b24b37db3`