Skip to main content

Extract text from .pptx and .odp files to strings in pure python.

Project description

pptx2txt2

Extract text from .pptx and .odp files to strings in pure python.

codecov GitHub Actions Workflow Status GitHub file size in bytes PyPI - License PyPI - Version Python Version from PEP 621 TOML

My personal replacement for pptx2txt.

It's intended to be very simple and provide some utilities to extract content similar to the original lib.

Usage

Install with your fave package manager (anything that pulls from pypi will work. pip, poetry, pdm, etx)

pip install pptx2txt2

Use with any PathLike object, like a filepath or IO stream.

There are 3 methods

  • extract_text_per_slide returns a dict[int, str] of per slide content & notes
  • extract_text utility to join all slide content
  • extract_images copy images over to another dir
import io
from pathlib import Path
import pptx2txt2

# path
text = pptx2txt2.extract_text("path/to/my.pptx")
text_per_slide = pptx2txt2.extract_text_per_slide("path/to/my.pptx")
image_paths = pptx2txt2.extract_images("path/to/my.pptx", "path/to/images/out")

# actual Paths
pptx_path = Path(__file__).parent / "my.pptx"
image_out = Path(__file__).parent / "my" / "images"
image_out.mkdir(parents=True)

text2 = pptx2txt2.extract_text(pptx_path)
text_per_slide2 = pptx2txt2.extract_text_per_slide(pptx_path)
image_paths2 = pptx2txt2.extract_images(pptx_path, image_out)

# bytestreams
pptx_bytes = b"..."
bytes_io = io.BytesIO(pptx_bytes)
text3 = pptx2txt2.extract_text(bytes_io)
text_per_slide3 = pptx2txt2.extract_text_per_slide(bytes_io)
image_paths3 = pptx2txt2.extract_images(bytes_io, "path/to/images/out")

Considerations

  • Doesn't preserve whitespace or styling like the original; new slides, tabs and the like are now just spaces.
  • headers and footers contain "<#>" of "" where there would be a number, unlike the original which removed them
  • pptx files have a UUID in text where images were.

Benchmarks

Basic benchmarking using pytest-benchmark with a basic test document on my M1 macbook and on GithubActions.

Macbook:

------------------------------------------------ benchmark: 1 tests -----------------------------------------------
Name (time in ms)               Min     Max    Mean  StdDev  Median     IQR  Outliers       OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------
test_benchmark_pptx2txt2     2.4470  7.1815  2.5762  0.4344  2.4987  0.1050       2;7  388.1666     122           1
-------------------------------------------------------------------------------------------------------------------

GitHub Actions, python 3.12:

------------------------------------------------ benchmark: 1 tests ------------------------------------------------
Name (time in ms)               Min      Max    Mean  StdDev  Median     IQR  Outliers       OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------
test_benchmark_pptx2txt2     4.0548  11.4523  4.2387  0.8312  4.1343  0.0484      3;11  235.9197     217           1
--------------------------------------------------------------------------------------------------------------------

Also See

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pptx2txt2-1.1.0.tar.gz (6.7 MB view hashes)

Uploaded Source

Built Distribution

pptx2txt2-1.1.0-py3-none-any.whl (6.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page