Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PII data file #828

Open
wants to merge 38 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
3fee793
Update README.md
shahrokhDaijavad Nov 15, 2024
f727f8a
Update README.md
shahrokhDaijavad Nov 18, 2024
06f91a3
Update README.md
shahrokhDaijavad Nov 18, 2024
581c1e9
Update README.md
shahrokhDaijavad Nov 18, 2024
2df4ada
Update README inweb2parquet
shahrokhDaijavad Nov 18, 2024
e50ae58
Update README.md for the web2parquet
shahrokhDaijavad Nov 18, 2024
850d10c
Update README-list.md
shahrokhDaijavad Nov 18, 2024
f1a5ed3
Update README-list.md
shahrokhDaijavad Nov 18, 2024
040b9d2
Update README.md
Padarn Nov 16, 2024
49b4022
Update README.md
shahrokhDaijavad Nov 18, 2024
cc27ad3
Create test
PoojaHolkar Nov 19, 2024
b66e0de
PII input file
PoojaHolkar Nov 24, 2024
269131b
PII_redactor code example
PoojaHolkar Nov 24, 2024
299aba3
invoice data
PoojaHolkar Nov 25, 2024
cf5fca8
upload data
PoojaHolkar Nov 25, 2024
96bc729
upload data
PoojaHolkar Nov 25, 2024
5355684
Delete examples/notebooks/PII/Invoice.pdf
PoojaHolkar Nov 25, 2024
5d739b9
Delete examples/notebooks/PII/invoicedata/test.py
PoojaHolkar Nov 25, 2024
006038a
notebook recipe for PII redaction code
PoojaHolkar Nov 25, 2024
2dd4c6d
update pdf2parquet README
dolfim-ibm Nov 13, 2024
5be3176
add data_files_to_use
dolfim-ibm Nov 13, 2024
13a90ce
doc_chunk README
dolfim-ibm Nov 13, 2024
a78729c
text_encoder README
dolfim-ibm Nov 13, 2024
e19d0d6
Added notebook for pdf2parquet
Nov 20, 2024
0de3005
Added doc chunk minimal notebook
touma-I Nov 20, 2024
51c3b3f
Update pdf2parquet.ipynb
shahrokhDaijavad Nov 20, 2024
fc3d134
Update pdf2parquet.ipynb
shahrokhDaijavad Nov 20, 2024
0e5e1ea
minimal sample notebook for how transform can be invoked
touma-I Nov 20, 2024
7941a91
restoring the make venv
shahrokhDaijavad Nov 20, 2024
1eab380
unification of notebooks
shahrokhDaijavad Nov 20, 2024
b7d34ce
added constraint for pydantic to prevent llama-index-core from picki…
touma-I Nov 22, 2024
b401b70
updated README file and added a sample notebook
Nov 19, 2024
8fe07f4
removed python code in README and minor changes in the notebook
Nov 19, 2024
51f3f78
updated with relative path and added markdown for notebook
Nov 21, 2024
5746aee
Update web2parquet.ipynb
shahrokhDaijavad Nov 22, 2024
fc84470
Update Run_your_first_PII_redactor_transform.ipynb
PoojaHolkar Nov 25, 2024
8238395
updated code
pholkar1 Nov 27, 2024
7282f1a
Delete examples/notebooks/PII/test
PoojaHolkar Nov 27, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 15 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,14 @@ Explore more examples [here](examples/notebooks).

### Run your first data prep pipeline

Now that you have run a single transform, the next step is to explore how to put these transforms together to run a data prep pipeline for an end to end use case like fine tuning model or building a RAG application. This [notebook](examples/notebooks/fine%20tuning/code/sample-notebook.ipynb) gives an example of how to build an end to end data prep pipeline for fine tuning for code LLMs. You can also explore how to build a RAG pipeline [here](examples/notebooks/rag).
Now that you have run a single transform, the next step is to explore how to put these transforms
together to run a data prep pipeline for an end to end use case like fine tuning a model or building
a RAG application.
This [notebook](examples/notebooks/fine%20tuning/code/sample-notebook.ipynb) gives an example of
how to build an end to end data prep pipeline for fine tuning for code LLMs. Similarly, this
[notebook](examples/notebooks/fine%20tuning/language/demo_with_launcher.ipynb) is a fine tuning
example of an end-to-end sample data pipeline designed for processing language datasets.
You can also explore how to build a RAG pipeline [here](examples/notebooks/rag).

### Current list of transforms
The matrix below shows the the combination of modules and supported runtimes. All the modules can be accessed [here](transforms) and can be combined to form data processing pipelines, as shown in the [examples](examples) folder.
Expand All @@ -133,7 +140,8 @@ The matrix below shows the the combination of modules and supported runtimes. Al
| **Data Ingestion** | | | | |
| [Code (from zip) to Parquet](transforms/code/code2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
| [PDF to Parquet](transforms/language/pdf2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
| [HTML to Parquet](transforms/language/html2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | |
| [HTML to Parquet](transforms/language/html2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
| [Web to Parquet](transforms/universal/web2parquet/README.md) | :white_check_mark: | | | |
| **Universal (Code & Language)** | | | | |
| [Exact dedup filter](transforms/universal/ededup/ray/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: |
| [Fuzzy dedup filter](transforms/universal/fdedup/ray/README.md) | | :white_check_mark: | | :white_check_mark: |
Expand Down Expand Up @@ -223,11 +231,11 @@ If you use Data Prep Kit in your research, please cite our paper:
@misc{wood2024dataprepkitgettingdataready,
title={Data-Prep-Kit: getting your data ready for LLM application development},
author={David Wood and Boris Lublinsky and Alexy Roytman and Shivdeep Singh
and Abdulhamid Adebayo and Revital Eres and Mohammad Nassar and Hima Patel
and Yousaf Shah and Constantin Adam and Petros Zerfos and Nirmit Desai
and Daiki Tsuzuku and Takuya Goto and Michele Dolfi and Saptha Surendran
and Paramesvaran Selvam and Sungeun An and Yuan Chi Chang and Dhiraj Joshi
and Hajar Emami-Gohari and Xuan-Hong Dang and Yan Koyfman and Shahrokh Daijavad},
and Constantin Adam and Abdulhamid Adebayo and Sungeun An and Yuan Chi Chang
and Xuan-Hong Dang and Nirmit Desai and Michele Dolfi and Hajar Emami-Gohari
and Revital Eres and Takuya Goto and Dhiraj Joshi and Yan Koyfman
and Mohammad Nassar and Hima Patel and Paramesvaran Selvam and Yousaf Shah
and Saptha Surendran and Daiki Tsuzuku and Petros Zerfos and Shahrokh Daijavad},
year={2024},
eprint={2409.18164},
archivePrefix={arXiv},
Expand Down
Binary file added examples/notebooks/Input-Test-Data/Invoice.pdf
Binary file not shown.
353 changes: 353 additions & 0 deletions examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,353 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Extracting Text from PDF and Configuring PII Redactor"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"**Author**: Pooja Holkar ,\n",
"**email**:[email protected]\n",
"\n",
"\n",
"\n",
"\n",
"### What is a PII Redactor?\n",
"\n",
"A PII (Personally Identifiable Information) Redactor is a tool designed to identify and redact sensitive information in text data. PII includes details that can be used to identify an individual, such as:\n",
"\n",
"Names\n",
"Email addresses\n",
"Phone numbers\n",
"Addresses\n",
"Financial details (e.g., credit card numbers)\n",
"\n",
"### Overview of the use case\n",
"In this usecase, the PII Redactor is applied to text extracted from invoices to ensure sensitive customer information is not exposed during processing, sharing, or storage.\n",
"\n",
" **Workflow Overview**\n",
"\n",
"The text from the invoice (a PDF document in this case) is extracted using the pdfplumber library.\n",
"\n",
" **Redactor Configuration**\n",
"\n",
"The system is configured to recognize specific PII entities relevant to invoices, such as:\n",
"Customer names\n",
"Email addresses\n",
"Phone numbers\n",
"Shipping addresses\n",
"\n",
" **PII Detection and Redaction**\n",
"\n",
"The redactor scans the extracted text and applies redaction rules, replacing sensitive details with placeholders.\n",
"Output:\n",
"\n",
"The redacted text is displayed alongside a summary of all identified PII entities for auditing purposes.\n",
"\n",
"### Why is PII Redaction Important?\n",
"\n",
" **Data Privacy Compliance**: Adheres to regulations like GDPR, HIPAA, or CCPA that mandate safeguarding customer information.\n",
"\n",
" **Risk Mitigation**: Prevents unauthorized access to or misuse of sensitive data.\n",
"\n",
" **Automation Benefits**: Simplifies and accelerates the process of securing information in large-scale document handling.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pre-req: Install data-prep-kit dependencies"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# !pip install transforms\n",
"# !pip install pdfplumber\n",
"# !pip install flair\n",
"# !pip install spacy\n",
"# !pip install presidio_anonymizer==2.2.355"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pdfplumber\n",
"from pii_redactor_transform import PIIRedactorTransform\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 1: Inspect the Data \n",
"\n",
"We will use simple invoice PDF\n",
"\n",
"[invoicedata](https://raw.githubusercontent.com/PoojaHolkar/data-prep-kit/refs/heads/dev/examples/notebooks/PII/invoicedata/Invoice.pdf)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"UsageError: Line magic function `%!wget` not found.\n"
]
}
],
"source": [
"!wget 'https://raw.githubusercontent.com/PoojaHolkar/data-prep-kit/refs/heads/dev/examples/notebooks/PII/invoicedata/Invoice.pdf'"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"pdf_path=\"Invoice.pdf\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 2: Extract Text from PDF\n",
"\n",
"This step uses the pdfplumber library to open and read a PDF file. The code processes each page of the PDF to extract text and concatenates it into a single string."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"with pdfplumber.open(pdf_path) as pdf:\n",
" text = \"\\n\".join(page.extract_text() for page in pdf.pages)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 3: Configure the PII Redactor\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"This configuration defines the parameters for identifying and redacting Personally Identifiable Information (PII) in the extracted text."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"\n",
"config = {\n",
" \"entities\": [\"PERSON\", \"EMAIL_ADDRESS\", \"PHONE_NUMBER\", \"LOCATION\"],\n",
" \"operator\": \"replace\",\n",
" \"transformed_contents\": \"redacted_contents\",\n",
" \"score_threshold\": 0.6\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 4: Initialize and Run the PII Redactor\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This step initializes the PII Redactor using the previously defined configuration and prepares it for processing the extracted text."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting en-core-web-sm==3.8.0\n",
" Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)\n",
"\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.8/12.8 MB\u001b[0m \u001b[31m9.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m[31m9.9 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\n",
"\u001b[?25hInstalling collected packages: en-core-web-sm\n",
"Successfully installed en-core-web-sm-3.8.0\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"17:45:46 INFO - Loading model from flair/ner-english-large\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n",
"You can now load the package via spacy.load('en_core_web_sm')\n",
"\u001b[38;5;3m⚠ Restart to reload dependencies\u001b[0m\n",
"If you are in a Jupyter or Colab notebook, you may need to restart Python in\n",
"order to load all the package's dependencies. You can do this by selecting the\n",
"'Restart kernel' or 'Restart runtime' option.\n",
"2024-11-25 17:46:04,004 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>\n"
]
}
],
"source": [
"\n",
"redactor = PIIRedactorTransform(config)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 5: Apply the Redactor to Text Data\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This step applies the initialized PII redactor to the extracted text, redacting sensitive information and providing details about the identified entities."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"\n",
"redacted_text, detected_entities = redactor._redact_pii(text)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 6: Display the Redaction Results\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This step outputs the results of the redaction process, including the redacted text and the details of the detected PII entities.\n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Redacted Text:\n",
" INVOICE\n",
"Apple Inc.\n",
"Invoice Details:\n",
"Invoice Number: INV-2024-001\n",
"Invoice Date: November 15, 2024\n",
"Due Date: November 30, 2024\n",
"Billing Information:\n",
"Customer Name: <PERSON>\n",
"Address: 123 <LOCATION>, Apt 45, <LOCATION>, <LOCATION> 62704\n",
"Email: <EMAIL_ADDRESS>\n",
"Phone: <PHONE_NUMBER>\n",
"Shipping Information:\n",
"Recipient Name: <PERSON>\n",
"Address: 123 <LOCATION>, Apt 45, <LOCATION>, <LOCATION> 62704\n",
"Item Details:\n",
"Description Quantity Unit Price Total\n",
"MacBook Air (13-inch, M2) 1 $999.00 $999.00\n",
"AppleCare+ for MacBook Air 1 $199.00 $199.00\n",
"Subtotal: $1,198.00\n",
"Tax (8%): $95.84\n",
"Total Amount Due: $1,293.84\n",
"Payment Method: Credit Card (Visa)\n",
"Transaction ID: 9876543210ABCDE\n",
"Notes:\n",
"Thank you for your purchase!\n",
"For assistance, please contact our support team at <EMAIL_ADDRESS> or 1-800-MY-APPLE.\n",
"Detected Entities:\n",
" ['PERSON', 'LOCATION', 'LOCATION', 'LOCATION', 'EMAIL_ADDRESS', 'PERSON', 'LOCATION', 'LOCATION', 'LOCATION', 'EMAIL_ADDRESS', 'PHONE_NUMBER']\n"
]
}
],
"source": [
"# Step 5: Print the Results\n",
"print(\"Redacted Text:\\n\", redacted_text)\n",
"print(\"Detected Entities:\\n\", detected_entities)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>\n",
"\n",
"### This notebook effectively demonstrates how to seamlessly apply redaction for PII entities"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Binary file added examples/notebooks/PII/invoicedata/Invoice.pdf
Binary file not shown.
Loading