Skip to main content

LangChain's Document model, implemented in Rust using pyo3 and maturin

Project description

RS Document

A opinionated Rust implementation of various common functions of LangChain's Document model as well as Unstructured.io's post processors.

Why?

I've been tinkering with different RAG projects and have landed on a set of processes that are pretty common between them. I was looking for a reason to try out the excellent maturin build system that allows for rust to be brought in as a python package. Since my common processes involve alot of text processing over ususally a large number of documents, this seemed like a great project to get going with it.

Installation

pip install rs_document

Usage

The main function of this package is to quickly clean and split many documents.

from rs_document import Document, clean_and_split_docs


# Create a document with known attributes
content = "A" * 4000
data = {"Hello": "World"}
doc = Document(page_content=content, metadata=data)


# Run all cleaners on the document
doc.clean()

# Recursively split document
doc.recursive_character_splitter(1000) # -> Produces list of documents

Cleaners

The cleaners that are reimplemented from Unstructured.io are:

  • clean_non_ascii_chars
  • clean_bullets
  • clean_ligatures
  • clean_extra_whitespace
  • group_broken_paragraphs
  • new_line_grouper
  • auto_paragraph_grouper

Instead of being standalone functions, I implemented them as methods on the Document class.

There is also a .clean() method, which will run all of the cleaners.

The test_cleaners.py module shows how they can be used.

from rs_document import Document


def test_non_ascii_characters_cleanup() -> None:
    doc = Document(
        page_content="\x88This text contains non-ascii characters!\x88",
        metadata={},
    )
    assert "\x88" in doc.page_content
    doc.clean_non_ascii_chars()
    assert (
        str(doc)
        == 'Document(page_content="This text contains non-ascii characters!", metadata={})'
    )
    assert "\x88" not in doc.page_content


def test_bullet_characters_cleanup() -> None:
    doc = Document(page_content="●  This is an excellent point!", metadata={})
    assert "●" in doc.page_content
    doc.clean_bullets()
    assert (
        str(doc) == 'Document(page_content="This is an excellent point!", metadata={})'
    )
    assert "●" not in doc.page_content


def test_ligature_cleanup() -> None:
    doc = Document(page_content="æ This is an excellent point!", metadata={})
    assert "æ" in doc.page_content
    doc.clean_ligatures()
    assert (
        str(doc)
        == 'Document(page_content="ae This is an excellent point!", metadata={})'
    )
    assert "æ" not in doc.page_content


def test_extra_whitespace_cleanup() -> None:
    doc = Document(page_content="ITEM 1.     BUSINESS ", metadata={})
    doc.clean_extra_whitespace()
    assert str(doc) == 'Document(page_content="ITEM 1. BUSINESS", metadata={})'

Splitters

There are two splitters:

  • split_on_num_characters
  • recursive_character_splitter

Similarly, they are implemented as methods on the doc class.

def test_splitting(document_fixture: Document) -> None:
    split = document_fixture.split_on_num_characters(5)
    assert len(split) == len(document_fixture.page_content) / 5
    assert split[0].metadata == {"Hello": "World"}
    assert split[0].page_content == "AAAAA"

A Note about recursive_character_splitter

The recursive character splitter is modeled after LangChain's recursive character splitter, but is absolutely not a 1:1 implementation. You will note that it doesn't take in a chunk overlap. This is because I've implemented it to have an effective overlap of about 1/3 the chunk size, as this is the number I've landed on being the most useful in my tinkering. Also, it doesn't allow passing in seperators, because the default seperators seem to be the best in every situation I've encountered. This makes the interface as simple as passing in a chunk_size.

clean_and_split_docs function

To keep interfacing with this module as quick and easy in most of my projects as possible, I've also implemented a wrapper function clean_and_split_docs, which takes in a list of documents (like what you'd get from a document loader) and a chunk_size, and it give back a list of clean and split documents.

Performance

I knew that rust has a leg up on text processing performance over python, but I wanted to be sure that the performance improvements would be worth taking on another dependancy.

I did some very scientific and rigorous testing on my personal machine, and cleaning and splitting 1,000 to 1,000,000 documents is somewhere between 40 and 75x faster with the rust implementation.

I added some tests to ensure as this package improves that those performace gains won't be lost.

The tests expect 25,000 documents to be processed per second, and for the rust version to be minimum 25 times faster than the python version.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rs_document-0.0.1.tar.gz (22.7 kB view hashes)

Uploaded Source

Built Distributions

rs_document-0.0.1-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.9 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

rs_document-0.0.1-pp310-pypy310_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl (2.1 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ s390x

rs_document-0.0.1-pp310-pypy310_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (2.0 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ppc64le

rs_document-0.0.1-pp310-pypy310_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.8 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARMv7l

rs_document-0.0.1-pp310-pypy310_pp73-manylinux_2_12_i686.manylinux2010_i686.whl (1.9 MB view hashes)

Uploaded PyPy manylinux: glibc 2.12+ i686

rs_document-0.0.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.9 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

rs_document-0.0.1-pp39-pypy39_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl (2.1 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ s390x

rs_document-0.0.1-pp39-pypy39_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (2.0 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ppc64le

rs_document-0.0.1-pp39-pypy39_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.8 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARMv7l

rs_document-0.0.1-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.9 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARM64

rs_document-0.0.1-pp39-pypy39_pp73-manylinux_2_12_i686.manylinux2010_i686.whl (1.9 MB view hashes)

Uploaded PyPy manylinux: glibc 2.12+ i686

rs_document-0.0.1-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.9 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

rs_document-0.0.1-pp38-pypy38_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl (2.1 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ s390x

rs_document-0.0.1-pp38-pypy38_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (2.0 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ppc64le

rs_document-0.0.1-pp38-pypy38_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.8 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARMv7l

rs_document-0.0.1-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.9 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARM64

rs_document-0.0.1-pp38-pypy38_pp73-manylinux_2_12_i686.manylinux2010_i686.whl (1.9 MB view hashes)

Uploaded PyPy manylinux: glibc 2.12+ i686

rs_document-0.0.1-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl (2.1 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.17+ s390x

rs_document-0.0.1-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (2.0 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.17+ ppc64le

rs_document-0.0.1-cp313-cp313-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.8 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.17+ ARMv7l

rs_document-0.0.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.9 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.17+ ARM64

rs_document-0.0.1-cp312-none-win_amd64.whl (785.4 kB view hashes)

Uploaded CPython 3.12 Windows x86-64

rs_document-0.0.1-cp312-none-win32.whl (716.7 kB view hashes)

Uploaded CPython 3.12 Windows x86

rs_document-0.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.9 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

rs_document-0.0.1-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl (2.1 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ s390x

rs_document-0.0.1-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (2.0 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ ppc64le

rs_document-0.0.1-cp312-cp312-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.8 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ ARMv7l

rs_document-0.0.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.9 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ ARM64

rs_document-0.0.1-cp312-cp312-manylinux_2_12_i686.manylinux2010_i686.whl (1.9 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.12+ i686

rs_document-0.0.1-cp312-cp312-macosx_11_0_arm64.whl (921.2 kB view hashes)

Uploaded CPython 3.12 macOS 11.0+ ARM64

rs_document-0.0.1-cp312-cp312-macosx_10_12_x86_64.whl (989.2 kB view hashes)

Uploaded CPython 3.12 macOS 10.12+ x86-64

rs_document-0.0.1-cp311-none-win_amd64.whl (783.2 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

rs_document-0.0.1-cp311-none-win32.whl (717.6 kB view hashes)

Uploaded CPython 3.11 Windows x86

rs_document-0.0.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.9 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

rs_document-0.0.1-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl (2.1 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ s390x

rs_document-0.0.1-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (2.0 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ppc64le

rs_document-0.0.1-cp311-cp311-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.8 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARMv7l

rs_document-0.0.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.9 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

rs_document-0.0.1-cp311-cp311-manylinux_2_12_i686.manylinux2010_i686.whl (1.9 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.12+ i686

rs_document-0.0.1-cp311-cp311-macosx_11_0_arm64.whl (922.0 kB view hashes)

Uploaded CPython 3.11 macOS 11.0+ ARM64

rs_document-0.0.1-cp311-cp311-macosx_10_12_x86_64.whl (990.7 kB view hashes)

Uploaded CPython 3.11 macOS 10.12+ x86-64

rs_document-0.0.1-cp310-none-win_amd64.whl (783.2 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

rs_document-0.0.1-cp310-none-win32.whl (717.5 kB view hashes)

Uploaded CPython 3.10 Windows x86

rs_document-0.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.9 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

rs_document-0.0.1-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl (2.1 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ s390x

rs_document-0.0.1-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (2.0 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ppc64le

rs_document-0.0.1-cp310-cp310-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.8 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARMv7l

rs_document-0.0.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.9 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

rs_document-0.0.1-cp310-cp310-manylinux_2_12_i686.manylinux2010_i686.whl (1.9 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.12+ i686

rs_document-0.0.1-cp310-cp310-macosx_11_0_arm64.whl (922.0 kB view hashes)

Uploaded CPython 3.10 macOS 11.0+ ARM64

rs_document-0.0.1-cp310-cp310-macosx_10_12_x86_64.whl (990.6 kB view hashes)

Uploaded CPython 3.10 macOS 10.12+ x86-64

rs_document-0.0.1-cp39-none-win_amd64.whl (783.4 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

rs_document-0.0.1-cp39-none-win32.whl (717.9 kB view hashes)

Uploaded CPython 3.9 Windows x86

rs_document-0.0.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.9 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

rs_document-0.0.1-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl (2.1 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ s390x

rs_document-0.0.1-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (2.0 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ppc64le

rs_document-0.0.1-cp39-cp39-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.8 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARMv7l

rs_document-0.0.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.9 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

rs_document-0.0.1-cp39-cp39-manylinux_2_12_i686.manylinux2010_i686.whl (1.9 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.12+ i686

rs_document-0.0.1-cp38-none-win_amd64.whl (782.8 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

rs_document-0.0.1-cp38-none-win32.whl (716.6 kB view hashes)

Uploaded CPython 3.8 Windows x86

rs_document-0.0.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.9 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

rs_document-0.0.1-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl (2.1 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ s390x

rs_document-0.0.1-cp38-cp38-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (2.0 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ppc64le

rs_document-0.0.1-cp38-cp38-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.8 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARMv7l

rs_document-0.0.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.9 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

rs_document-0.0.1-cp38-cp38-manylinux_2_12_i686.manylinux2010_i686.whl (1.9 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.12+ i686

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page