Project description

HTML Cleaver 🍀🦫

Tool for parsing HTML into a chain of chunks with relevant headers.

The API entry-point is in src/html_cleaver/cleaver.
The logical algorithm and data-structures are in src/html_cleaver/handler.

This is a "tree-capitator" if you will,
cleaving headers together while cleaving text apart.

Quickstart:

pip install html-cleaver

Optionally, if you're working with HTML that requires javascript rendering:
pip install selenium

Testing an example on the command-line: python -m html_cleaver.cleaver https://plato.stanford.edu/entries/goedel/

Example usage:

Cleaving pages of varying difficulties:

from html_cleaver.cleaver import get_cleaver

# default parser is "lxml" for loose html
with get_cleaver() as cleaver:
    
    # handle chunk-events directly
    # (example of favorable structure yielding high-quality chunks)
    cleaver.parse_events(
        ["https://plato.stanford.edu/entries/goedel/"],
        print)
    
    # get collection of chunks
    # (example of moderate structure yielding medium-quality chunks)
    for c in cleaver.parse_chunk_sequence(
            ["https://en.wikipedia.org/wiki/Kurt_G%C3%B6del"]):
        print(c)
    
    # sequence of chunks from sequence of pages
    # (examples of challenging structure yielding poor-quality chunks)
    l = [
        "https://www.gutenberg.org/cache/epub/56852/pg56852-images.html",
        "https://www.cnn.com/2023/09/25/opinions/opinion-vincent-doumeizel-seaweed-scn-climate-c2e-spc-intl"]
    for c in cleaver.parse_chunk_sequence(l):
        print(c)

# example of mitigating/improving challenging structure by focusing on certain headers
with get_cleaver("lxml", ["h4", "h5"]) as cleaver:
    cleaver.parse_events(
        ["https://www.gutenberg.org/cache/epub/56852/pg56852-images.html"],
        print)

Example usage with Selenium:

Using selenium on a page that requires javascript to load contents:

from html_cleaver.cleaver import get_cleaver

print("using default lxml produces very few chunks:")
with get_cleaver() as cleaver:
    cleaver.parse_events(
        ["https://www.youtube.com/watch?v=rfscVS0vtbw"],
        print)

print("using selenium produces many more chunks:")
with get_cleaver("selenium") as cleaver:
    cleaver.parse_events(
        ["https://www.youtube.com/watch?v=rfscVS0vtbw"],
        print)

Development:

Testing:

Testing without Poetry:
pip install lxml
pip install selenium
python -m unittest discover -s src

Testing with Poetry:
poetry install
poetry run pytest

Build:

Building from source:
rm dist/*
python -m build

Installing from the build:
pip install dist/*.whl

Publishing from the build:
python -m twine upload --skip-existing -u __token__ -p $TESTPYPI_TOKEN --repository testpypi dist/*
python -m twine upload --skip-existing -u __token__ -p $PYPI_TOKEN dist/*

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.3.0

Nov 29, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html_cleaver-0.3.0.tar.gz (15.9 kB view hashes)

Uploaded Nov 29, 2023 Source

Built Distribution

html_cleaver-0.3.0-py3-none-any.whl (13.5 kB view hashes)

Uploaded Nov 29, 2023 Python 3

Hashes for html_cleaver-0.3.0.tar.gz

Hashes for html_cleaver-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`d7901934f083e6f36682645e314cea599a2cb2e8139e1a3a5ab581235f0e3839`
MD5	`5b17920daf103e4824e8218d379afe48`
BLAKE2b-256	`c20b6292ea062765c6f660edb84d355e39c65e29d8f0f04207dae337ad3e1f05`

Hashes for html_cleaver-0.3.0-py3-none-any.whl

Hashes for html_cleaver-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f3640fcd887796578f8b7bd4017cb81f27729017020d0dff7ff00d64eae0119a`
MD5	`98501c231d07c0269970a61ac70f8765`
BLAKE2b-256	`9cc194291576bb6d92b8bf2d2d82dc9f85c1c76ba30df8f1e27f84c04d9a013a`