datafog

Scan, redact, and manage PII in your documents before they get uploaded to a Retrieval Augmented Generation (RAG) system.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Open-source DevSecOps for Generative AI Systems.

Overview

What is DataFog?

DataFog is an open-source DevSecOps platform that lets you scan and redact Personally Identifiable Information (PII) out of your Generative AI applications.

Core Problem

How it works

Installation

DataFog can be installed via pip:

pip install datafog

Examples -

v3.2.0 NEW

Based on the provided test cases, here's a suitable "Getting Started" section for the documentation:

Getting Started

The DataFog library provides functionality for text and image processing, including PII (Personally Identifiable Information) annotation and OCR (Optical Character Recognition) capabilities.

Installation

To install the DataFog library, use the following command:

pip install datafog

Usage

Here are some examples of how to use the DataFog library:

Text PII Annotation

To annotate PII in a given text, lets start with a set of clinical notes:

!git clone https://gist.github.com/b43b72693226422bac5f083c941ecfdb.git

from datafog import TextPIIAnnotator

text = "John Doe lives at 1234 Elm St, Springfield."
text_annotator = TextPIIAnnotator()
annotated_text = text_annotator.run(text)
print(annotated_text)

This will output the annotated text with PII labeled, such as {"LOC": ["Springfield"]}.

Image Text Extraction and Annotation

To extract text from an image and perform PII annotation, you can use the DataFog class:

from datafog import DataFog

image_url = "https://example.com/image.png"
datafog = DataFog()
annotated_text = await datafog.run_ocr_pipeline([image_url])
print(annotated_text)

This will download the image, extract the text using OCR, and annotate any PII found in the extracted text.

Text Processing

To process and annotate text using the DataFog pipeline, you can use the DataFog class:

from datafog import DataFog

text = ["Tokyo is the capital of Japan"]
datafog = DataFog()
annotated_text = await datafog.run_text_pipeline(text)
print(annotated_text)

This will process the given text and annotate entities such as person names and locations.

For more detailed usage and examples, please refer to the API documentation.

Note: The DataFog library uses asynchronous programming, so make sure to use the async/await syntax when calling the appropriate methods.

v3.1.0

Base case: PII annotation of text-files

from datafog import OCRPIIAnnotator, TextPIIAnnotator
import json
import requests

response = requests.get('https://gist.githubusercontent.com/sidmohan0/1aa3ec38b4e6594d3c34b113f2e0962d/raw/42e57146197be0f85a5901cd1dcdd9ad15b31bab/sotu_2023.txt')
response.raise_for_status()  # Ensure the request was successful
text = response.text
# print(text)
text_annotator = TextPIIAnnotator()
annotated_text = text_annotator.run(text, output_path=f"sotu_2023_output.json")
print("Annotated Text:", annotated_text)

OCR Reference Set (Images)

image_set = {
    "medical_invoice": "https://s3.amazonaws.com/thumbnails.venngage.com/template/dc377004-1c2d-49f2-8ddf-d63f11c8d9c2.png",
    "sales_receipt": "https://templates.invoicehome.com/sales-receipt-template-us-classic-white-750px.png",
    "press_release": "https://newsroom.cisco.com/c/dam/r/newsroom/en/us/assets/a/y2023/m09/cisco_splunk_1200x675_v3.png",
    "insurance_claim_scanned_form": "https://www.pdffiller.com/preview/101/35/101035394.png",
    "scanned_internal_record": "https://www.pdffiller.com/preview/435/972/435972694.png",
    "executive_email": "https://pbs.twimg.com/media/GM3-wpeWkAAP-cX.jpg"
}

OCR text extraction from images + PII annotation

with this, you can then run the following steps:

from datafog import OCRPIIAnnotator, TextPIIAnnotator
import json

image_url = image_set["executive_email"]

annotator = OCRPIIAnnotator()
annotated_text = annotator.run(image_url, output_path=f"executive_email_output.json")
print("Annotated Text:", annotated_text)

and the output should look like this:

Annotated Text: {'DATE_TIME': ['Wednesday', 'June 12, 2019'], 'LOC': [], 'NRP': [], 'ORG': [], 'PER': ['Kevin Scott Sent', 'Satya Nadella', 'Bill Gates Subject', 'Thoughts']}

With PySpark

Note: as of 3.1.0, you'll need to start the Spark session by instancing the DataFog class as shown below

from datafog import DataFog
from datafog.pii_annotation import ImageProcessor
datafog = DataFog()

# let's process the images that we shared above
processed_images = [(name, ImageProcessor().download_image(url=image_url)) for name, image_url in image_set.items()]

from datafog.pii_annotation import SparkService
parsed_images = [(name, ImageProcessor().parse_image(img)) for name, img in processed_images]

df = SparkService().spark.createDataFrame(parsed_images, ["image_name", "parsed_data"])

# Display DataFrame
df.show(truncate=False)

Contributing

DataFog is a community-driven open-source platform and we've been fortunate to have a small and growing contributor base. We'd love to hear ideas, feedback, suggestions for improvement - anything on your mind about what you think can be done to make DataFog better! Join our Discord and join our growing community.

Dev Notes

Justfile commands:
- just format to apply formatting.
- just lint to check formatting and style.

Testing

To run the datafog unit tests, check out this repository and do

tox

License

This software is published under the MIT license.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

3.2.1b3 pre-release

May 18, 2024

3.2.1b2 pre-release

May 18, 2024

3.2.1b1 pre-release

May 18, 2024

This version

3.2.0

May 14, 2024

3.2.0b20 pre-release

May 14, 2024

3.2.0b16 pre-release

May 14, 2024

3.2.0b15 pre-release

May 13, 2024

3.2.0b14 pre-release

May 13, 2024

3.2.0b13 pre-release

May 13, 2024

3.2.0b12 pre-release

May 13, 2024

3.2.0b11 pre-release

May 13, 2024

3.2.0b10 pre-release

May 13, 2024

3.2.0b9 pre-release

May 13, 2024

3.2.0b8 pre-release

May 13, 2024

3.2.0b7 pre-release

May 13, 2024

3.2.0b6 pre-release

May 13, 2024

3.2.0b5 pre-release

May 13, 2024

3.2.0b4 pre-release

May 13, 2024

3.2.0b3 pre-release

May 13, 2024

3.2.0b2 pre-release

May 13, 2024

3.2.0b1 pre-release

May 13, 2024

3.1.0

May 10, 2024

3.1.0b1 pre-release

May 7, 2024

3.0.1

May 6, 2024

3.0.1b1 pre-release

May 6, 2024

3.0.0

May 6, 2024

3.0.0b6 pre-release

Apr 29, 2024

3.0.0b5 pre-release

Apr 23, 2024

3.0.0b4 pre-release

Apr 23, 2024

3.0.0b3 pre-release

Apr 22, 2024

3.0.0b2 pre-release

Apr 22, 2024

3.0.0b1 pre-release

Apr 21, 2024

2.4.0

Apr 2, 2024

2.4.0b4 pre-release

Apr 2, 2024

2.4.0b3 pre-release

Apr 1, 2024

2.4.0b2 pre-release

Apr 1, 2024

2.4.0b1 pre-release

Apr 1, 2024

2.4.0a4 pre-release

Apr 1, 2024

2.4.0a3 pre-release

Apr 1, 2024

2.4.0a2 pre-release

Apr 1, 2024

2.4.0a1 pre-release

Apr 1, 2024

2.3.2

Mar 25, 2024

2.3.2b10 pre-release

Mar 25, 2024

2.3.2b9 pre-release

Mar 25, 2024

2.3.2b8 pre-release

Mar 25, 2024

2.3.2b7 pre-release

Mar 25, 2024

2.3.2b6 pre-release

Mar 25, 2024

2.3.2b5 pre-release

Mar 25, 2024

2.3.2b4 pre-release

Mar 25, 2024

2.3.2b3 pre-release

Mar 25, 2024

2.3.2b2 pre-release

Mar 25, 2024

2.3.2b1 pre-release

Mar 14, 2024

2.3.1

Mar 14, 2024

2.3.0

Mar 12, 2024

2.3.0b3 pre-release

Mar 12, 2024

2.3.0b2 pre-release

Mar 11, 2024

2.3.0b1 pre-release

Mar 11, 2024

2.2.2 yanked

Mar 7, 2024

Reason this release was yanked:

Unstable

2.2.0

Mar 10, 2024

2.2.0b1 pre-release

Mar 10, 2024

2.1.1

Mar 5, 2024

2.0.1

Feb 23, 2024

2.0.0 yanked

Feb 21, 2024

1.4.0

Feb 16, 2024

1.3.8

Jun 22, 2023

1.3.7

Jun 22, 2023

1.3.6

Jun 22, 2023

1.3.5

Jun 22, 2023

1.3.4

Jun 22, 2023

1.3.3

Jun 22, 2023

1.3.2

Jun 22, 2023

1.3.1

Jun 22, 2023

1.3.0

Jun 22, 2023

1.2.0

Jun 22, 2023

1.1.0

Jun 22, 2023

1.0

Jun 21, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datafog-3.2.0.tar.gz (15.3 kB view hashes)

Uploaded May 14, 2024 Source

Hashes for datafog-3.2.0.tar.gz

Hashes for datafog-3.2.0.tar.gz
Algorithm	Hash digest
SHA256	`074fd853b4384d1bcac817135ebba95c103e17265c5d830dd86ca9826cfab0c1`
MD5	`254752aa07acae55c469741829d9ef80`
BLAKE2b-256	`1d551c857d927b85ea9a5aa6882d48f42f9265b5f3366936fb0f060bd3059cd7`

datafog 3.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Overview

What is DataFog?

Core Problem

How it works

Installation

Examples -

v3.2.0 NEW

Getting Started

Installation

Usage

Text PII Annotation

Image Text Extraction and Annotation

Text Processing

v3.1.0

Base case: PII annotation of text-files

OCR Reference Set (Images)

OCR text extraction from images + PII annotation

With PySpark

Contributing

Dev Notes

Testing

License

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution