Extract text from .docx and .odt files to strings in pure python.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

docx2txt2

Extract text from .docx and .odt files to strings in pure python.

My personal replacement for docx2txt.

It's intended to be very simple and provide some utilities to match the functionality of the original lib.

Usage

Install with your fave package manager (anything that pulls from pypi will work. pip, poetry, pdm, etc)

pip install docx2txt2

Use with any PathLike object, like a filepath or IO stream.

import io
from pathlib import Path
import docx2txt2

# path
text = docx2txt2.extract_text("path/to/my.docx")
image_paths = docx2txt2.extract_images("path/to/my.docx", "path/to/images/out")

# actual Paths
docx_path = Path(__file__).parent / "my.docx"
image_out = Path(__file__).parent / "my" / "images"
image_out.mkdir(parents=True)

text2 = docx2txt2.extract_text(docx_path)
image_paths2 = docx2txt2.extract_images(docx_path, image_out)

# bytestreams
docx_bytes = b"..."
bytes_io = io.BytesIO(docx_bytes)
text3 = docx2txt2.extract_text(bytes_io)
image_paths3 = docx2txt2.extract_images(bytes_io, "path/to/images/out")

Compatability & Motivation

docx2txt2 provides a superset of all data returned by docx2txt with some caveats (below), so the below is true:

import docx2txt

import docx2txt2

orig_content = docx2txt.process("my/file.docx").split()
new_content = docx2txt2.process("my/file.docx").split()

assert all(orig in new_content for orig in orig_content)

This is a test in test_extract_data.test_docx2txt_compatability

Compatability & Caveats

Doesn't preserve whitespace or styling like the original; new pages, tabs and the like are now just spaces.
headers and footers contain "PAGE" where there would be a page number, unlike the original which removed them.

Motivations for rewrite:

Speed, I have lots of word docs to process and I saw some efficiency gains over the original lib.
Formatting, I didn't want to do whitespace removal for every run; this preformats output to only include spaces.

Benchmarks

Basic benchmarking using pytest-benchmark with a basic test document on my M1 macbook and on GithubActions. From these tests it appears this lib is a sneak under ~2x faster on average.

Macbook:

----------------------------------------------------------------------------------- benchmark: 2 tests ----------------------------------------------------------------------------------
Name (time in ms)               Min               Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_docx2txt2     1.1498 (1.0)      6.2305 (1.0)      1.1949 (1.0)      0.3096 (1.0)      1.1685 (1.0)      0.0142 (1.0)          3;74  836.9124 (1.0)         724           1
test_benchmark_docx2txt      2.1684 (1.89)     7.5298 (1.21)     2.2469 (1.88)     0.3941 (1.27)     2.2044 (1.89)     0.0231 (1.62)         2;41  445.0671 (0.53)        365           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

GitHub Actions, python 3.12:

----------------------------------------------------------------------------------- benchmark: 2 tests -----------------------------------------------------------------------------------
Name (time in ms)               Min                Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_docx2txt2     1.5368 (1.0)       8.6408 (1.0)      1.6104 (1.0)      0.4961 (1.0)      1.5697 (1.0)      0.0349 (1.0)          3;11  620.9509 (1.0)         565           1
test_benchmark_docx2txt      3.0235 (1.97)     10.1797 (1.18)     3.1365 (1.95)     0.5956 (1.20)     3.0822 (1.96)     0.0356 (1.02)         2;10  318.8220 (0.51)        279           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Disclaimer: More thorough benchmarking could be conducted. This is a faster lib in general but I haven't tested edge cases.

Also see:

pptx2txt2 for pptx/odp conversion

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

1.0.4

Mar 13, 2024

1.0.3

Mar 13, 2024

1.0.2

Mar 13, 2024

0.1.0

Mar 13, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docx2txt2-1.0.4.tar.gz (1.2 MB view hashes)

Uploaded Mar 13, 2024 Source

Built Distribution

docx2txt2-1.0.4-py3-none-any.whl (6.5 kB view hashes)

Uploaded Mar 13, 2024 Python 3

Hashes for docx2txt2-1.0.4.tar.gz

Hashes for docx2txt2-1.0.4.tar.gz
Algorithm	Hash digest
SHA256	`62e3c508726f668a21bc2cfa4c376714c9074edced492a9b9760ed0dafb20db5`
MD5	`d4c00f9cc13e12aed8d5063f1199a443`
BLAKE2b-256	`d1fcc07c6013a66b74f428a1ec841d8898f10fd4b387f98bb0ae98789e908edd`

Hashes for docx2txt2-1.0.4-py3-none-any.whl

Hashes for docx2txt2-1.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`59c3ea13eaf15613224b7912c241fca455ba16abe93e493e6c9e05c8e59d17fa`
MD5	`1ed5a8f9b57278c30e8891a3f23a473f`
BLAKE2b-256	`eabd19e106b5e5225d9214445fc0dbdf2600279f359c9b5fb5aca54c267cfba7`