Skip to main content

cleave html headers and text

Project description

License: MIT GitHub Latest Release

GitHub Latest Pre-Release GitHub Continuous Integration

HTML Cleaver 🍀🦫

Tool for parsing HTML into a chain of chunks with relevant headers.

The API entry-point is in src/html_cleaver/cleaver.
The logical algorithm and data-structures are in src/html_cleaver/handler.

This is a "tree-capitator" if you will,
cleaving headers together while cleaving text apart.

Quickstart:

pip install html-cleaver

Optionally, if you're working with HTML that requires javascript rendering:
pip install selenium

Testing an example on the command-line: python -m html_cleaver.cleaver https://plato.stanford.edu/entries/goedel/

Example usage:

Cleaving pages of varying difficulties:

from html_cleaver.cleaver import get_cleaver

# default parser is "lxml" for loose html
with get_cleaver() as cleaver:
    
    # handle chunk-events directly
    # (example of favorable structure yielding high-quality chunks)
    cleaver.parse_events(
        ["https://plato.stanford.edu/entries/goedel/"],
        print)
    
    # get collection of chunks
    # (example of moderate structure yielding medium-quality chunks)
    for c in cleaver.parse_chunk_sequence(
            ["https://en.wikipedia.org/wiki/Kurt_G%C3%B6del"]):
        print(c)
    
    # sequence of chunks from sequence of pages
    # (examples of challenging structure yielding poor-quality chunks)
    l = [
        "https://www.gutenberg.org/cache/epub/56852/pg56852-images.html",
        "https://www.cnn.com/2023/09/25/opinions/opinion-vincent-doumeizel-seaweed-scn-climate-c2e-spc-intl"]
    for c in cleaver.parse_chunk_sequence(l):
        print(c)

# example of mitigating/improving challenging structure by focusing on certain headers
with get_cleaver("lxml", ["h4", "h5"]) as cleaver:
    cleaver.parse_events(
        ["https://www.gutenberg.org/cache/epub/56852/pg56852-images.html"],
        print)

Example usage with Selenium:

Using selenium on a page that requires javascript to load contents:

from html_cleaver.cleaver import get_cleaver

print("using default lxml produces very few chunks:")
with get_cleaver() as cleaver:
    cleaver.parse_events(
        ["https://www.youtube.com/watch?v=rfscVS0vtbw"],
        print)

print("using selenium produces many more chunks:")
with get_cleaver("selenium") as cleaver:
    cleaver.parse_events(
        ["https://www.youtube.com/watch?v=rfscVS0vtbw"],
        print)

Development:

Testing:

Testing without Poetry:
pip install lxml
pip install selenium
python -m unittest discover -s src

Testing with Poetry:
poetry install
poetry run pytest

Build:

Building from source:
rm dist/*
python -m build

Installing from the build:
pip install dist/*.whl

Publishing from the build:
python -m twine upload --skip-existing -u __token__ -p $TESTPYPI_TOKEN --repository testpypi dist/*
python -m twine upload --skip-existing -u __token__ -p $PYPI_TOKEN dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html_cleaver-0.3.0.tar.gz (15.9 kB view hashes)

Uploaded Source

Built Distribution

html_cleaver-0.3.0-py3-none-any.whl (13.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page