rapidocr-pdf

Tools of extracting PDF content based on RapidOCR

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

RapidOCRPDF

Relying on RapidOCR, quickly extract text from PDF, including scanned PDF and encrypted PDF.
Layout restore is not included for now.

1. Install package by pypi.

# base rapidocr_onnxruntime
pip install rapidocr_pdf[onnxruntime]

# base rapidocr_openvino
pip install rapidocr_pdf[openvino]

2. Usage

Run by script.

from rapidocr_pdf import PDFExtracter

pdf_extracter = PDFExtracter()

pdf_path = 'tests/test_files/direct_and_image.pdf'
texts = pdf_extracter(pdf_path)
print(texts)

Run by command line.

$ rapidocr_pdf -h
usage: rapidocr_pdf [-h] [-path FILE_PATH]

options:
-h, --help            show this help message and exit
-path FILE_PATH, --file_path FILE_PATH
                        File path, PDF or images

$ rapidocr_pdf -path tests/test_files/direct_and_image.pdf

3. Ouput format.

Input：Union[str, Path, bytes]

Output：List [Page num, Page content + score], ：

[
    ['0', '达大学拉斯维加斯分校）的一次中文评测中获得最', '0.8969868'],
    ['1', 'ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network∗\nYuliang Liu‡†', '0.8969868'],
]

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.0

Apr 27, 2024

0.0.8

Dec 4, 2023

0.0.7

Nov 18, 2023

0.0.6

Aug 28, 2023

0.0.5

Aug 24, 2023

0.0.4

Aug 24, 2023

0.0.3

Jul 26, 2023

0.0.2

Apr 27, 2023

0.0.1

Apr 17, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

rapidocr_pdf-0.1.0-py3-none-any.whl (9.2 kB view hashes)

Uploaded Apr 27, 2024 Python 3

Hashes for rapidocr_pdf-0.1.0-py3-none-any.whl

Hashes for rapidocr_pdf-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3c64fd950033e7ceb65fadffdb8f8c50e60b5667322302f677f460b73f6e751a`
MD5	`cee5048c1da8cab69493eb6adf91c2c6`
BLAKE2b-256	`c72d7e916675a28dd37dbdd6027260990e65459652acd000810df7e7f7cc517c`