IBM · PoojaHolkar · Nov 15, 2024 · Nov 18, 2024 · Nov 18, 2024 · Nov 18, 2024
diff --git a/README.md b/README.md
@@ -122,7 +122,14 @@ Explore more examples [here](examples/notebooks).
 
 ### Run your first data prep pipeline
 
-Now that you have run a single transform, the next step is to explore how to put these transforms together to run a data prep pipeline for an end to end use case like fine tuning model or building a RAG application. This [notebook](examples/notebooks/fine%20tuning/code/sample-notebook.ipynb) gives an example of how to build an end to end data prep pipeline for fine tuning for code LLMs. You can also explore how to build a RAG pipeline [here](examples/notebooks/rag).
+Now that you have run a single transform, the next step is to explore how to put these transforms 
+together to run a data prep pipeline for an end to end use case like fine tuning a model or building 
+a RAG application. 
+This [notebook](examples/notebooks/fine%20tuning/code/sample-notebook.ipynb) gives an example of 
+how to build an end to end data prep pipeline for fine tuning for code LLMs. Similarly, this 
+[notebook](examples/notebooks/fine%20tuning/language/demo_with_launcher.ipynb) is a fine tuning 
+example of an end-to-end sample data pipeline designed for processing language datasets. 
+You can also explore how to build a RAG pipeline [here](examples/notebooks/rag).
 
 ### Current list of transforms 
 The matrix below shows the the combination of modules and supported runtimes. All the modules can be accessed [here](transforms) and can be combined to form data processing pipelines, as shown in the [examples](examples) folder. 
@@ -133,7 +140,8 @@ The matrix below shows the the combination of modules and supported runtimes. Al
 | **Data Ingestion**                                                                   |                    |                    |                    |                    |
 | [Code (from zip) to Parquet](transforms/code/code2parquet/python/README.md)          | :white_check_mark: | :white_check_mark: |                    | :white_check_mark: |
 | [PDF to Parquet](transforms/language/pdf2parquet/python/README.md)                   | :white_check_mark: | :white_check_mark: |                    | :white_check_mark: |
-| [HTML to Parquet](transforms/language/html2parquet/python/README.md)                 | :white_check_mark: | :white_check_mark: |                    |                    |
+| [HTML to Parquet](transforms/language/html2parquet/python/README.md)                 | :white_check_mark: | :white_check_mark: |                    | :white_check_mark: |
+| [Web to Parquet](transforms/universal/web2parquet/README.md)                         | :white_check_mark: |                    |                    |                    |         
 | **Universal (Code & Language)**                                                      |                    |                    |                    |                    | 
 | [Exact dedup filter](transforms/universal/ededup/ray/README.md)                      | :white_check_mark: | :white_check_mark: |                    | :white_check_mark: |
 | [Fuzzy dedup filter](transforms/universal/fdedup/ray/README.md)                      |                    | :white_check_mark: |                    | :white_check_mark: |
@@ -223,11 +231,11 @@ If you use Data Prep Kit in your research, please cite our paper:
 @misc{wood2024dataprepkitgettingdataready,
       title={Data-Prep-Kit: getting your data ready for LLM application development}, 
       author={David Wood and Boris Lublinsky and Alexy Roytman and Shivdeep Singh 
-      and Abdulhamid Adebayo and Revital Eres and Mohammad Nassar and Hima Patel 
-      and Yousaf Shah and Constantin Adam and Petros Zerfos and Nirmit Desai 
-      and Daiki Tsuzuku and Takuya Goto and Michele Dolfi and Saptha Surendran 
-      and Paramesvaran Selvam and Sungeun An and Yuan Chi Chang and Dhiraj Joshi 
-      and Hajar Emami-Gohari and Xuan-Hong Dang and Yan Koyfman and Shahrokh Daijavad},
+      and Constantin Adam and Abdulhamid Adebayo and Sungeun An and Yuan Chi Chang 
+      and Xuan-Hong Dang and Nirmit Desai and Michele Dolfi and Hajar Emami-Gohari 
+      and Revital Eres and Takuya Goto and Dhiraj Joshi and Yan Koyfman 
+      and Mohammad Nassar and Hima Patel and Paramesvaran Selvam and Yousaf Shah  
+      and Saptha Surendran and Daiki Tsuzuku and Petros Zerfos and Shahrokh Daijavad},
       year={2024},
       eprint={2409.18164},
       archivePrefix={arXiv},

diff --git a/examples/notebooks/Input-Test-Data/Invoice.pdf b/examples/notebooks/Input-Test-Data/Invoice.pdf
diff --git a/examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb b/examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb
@@ -0,0 +1,353 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Extracting Text from PDF and Configuring PII Redactor"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "**Author**: Pooja Holkar ,\n",
+    "**email**:[email protected]\n",
+    "\n",
+    "\n",
+    "\n",
+    "\n",
+    "### What is a PII Redactor?\n",
+    "\n",
+    "A PII (Personally Identifiable Information) Redactor is a tool designed to identify and redact sensitive information in text data. PII includes details that can be used to identify an individual, such as:\n",
+    "\n",
+    "Names\n",
+    "Email addresses\n",
+    "Phone numbers\n",
+    "Addresses\n",
+    "Financial details (e.g., credit card numbers)\n",
+    "\n",
+    "### Overview of the use case\n",
+    "In this usecase, the PII Redactor is applied to text extracted from invoices to ensure sensitive customer information is not exposed during processing, sharing, or storage.\n",
+    "\n",
+    " **Workflow Overview**\n",
+    "\n",
+    "The text from the invoice (a PDF document in this case) is extracted using the pdfplumber library.\n",
+    "\n",
+    " **Redactor Configuration**\n",
+    "\n",
+    "The system is configured to recognize specific PII entities relevant to invoices, such as:\n",
+    "Customer names\n",
+    "Email addresses\n",
+    "Phone numbers\n",
+    "Shipping addresses\n",
+    "\n",
+    " **PII Detection and Redaction**\n",
+    "\n",
+    "The redactor scans the extracted text and applies redaction rules, replacing sensitive details with placeholders.\n",
+    "Output:\n",
+    "\n",
+    "The redacted text is displayed alongside a summary of all identified PII entities for auditing purposes.\n",
+    "\n",
+    "### Why is PII Redaction Important?\n",
+    "\n",
+    " **Data Privacy Compliance**: Adheres to regulations like GDPR, HIPAA, or CCPA that mandate safeguarding customer information.\n",
+    "\n",
+    " **Risk Mitigation**: Prevents unauthorized access to or misuse of sensitive data.\n",
+    "\n",
+    " **Automation Benefits**: Simplifies and accelerates the process of securing information in large-scale document handling.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Pre-req: Install data-prep-kit dependencies"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# !pip install transforms\n",
+    "# !pip install pdfplumber\n",
+    "# !pip install flair\n",
+    "# !pip install spacy\n",
+    "# !pip install presidio_anonymizer==2.2.355"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pdfplumber\n",
+    "from pii_redactor_transform import PIIRedactorTransform\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 1: Inspect the Data \n",
+    "\n",
+    "We will use simple invoice PDF\n",
+    "\n",
+    "[invoicedata](https://raw.githubusercontent.com/PoojaHolkar/data-prep-kit/refs/heads/dev/examples/notebooks/PII/invoicedata/Invoice.pdf)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "UsageError: Line magic function `%!wget` not found.\n"
+     ]
+    }
+   ],
+   "source": [
+    "!wget 'https://raw.githubusercontent.com/PoojaHolkar/data-prep-kit/refs/heads/dev/examples/notebooks/PII/invoicedata/Invoice.pdf'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pdf_path=\"Invoice.pdf\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 2: Extract Text from PDF\n",
+    "\n",
+    "This step uses the pdfplumber library to open and read a PDF file. The code processes each page of the PDF to extract text and concatenates it into a single string."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with pdfplumber.open(pdf_path) as pdf:\n",
+    "    text = \"\\n\".join(page.extract_text() for page in pdf.pages)\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 3: Configure the PII Redactor\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "This configuration defines the parameters for identifying and redacting Personally Identifiable Information (PII) in the extracted text."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "config = {\n",
+    "    \"entities\": [\"PERSON\", \"EMAIL_ADDRESS\", \"PHONE_NUMBER\", \"LOCATION\"],\n",
+    "    \"operator\": \"replace\",\n",
+    "    \"transformed_contents\": \"redacted_contents\",\n",
+    "    \"score_threshold\": 0.6\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 4: Initialize and Run the PII Redactor\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This step initializes the PII Redactor using the previously defined configuration and prepares it for processing the extracted text."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Collecting en-core-web-sm==3.8.0\n",
+      "  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)\n",
+      "\u001b[2K     \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.8/12.8 MB\u001b[0m \u001b[31m9.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m[31m9.9 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\n",
+      "\u001b[?25hInstalling collected packages: en-core-web-sm\n",
+      "Successfully installed en-core-web-sm-3.8.0\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "17:45:46 INFO - Loading model from flair/ner-english-large\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n",
+      "You can now load the package via spacy.load('en_core_web_sm')\n",
+      "\u001b[38;5;3m⚠ Restart to reload dependencies\u001b[0m\n",
+      "If you are in a Jupyter or Colab notebook, you may need to restart Python in\n",
+      "order to load all the package's dependencies. You can do this by selecting the\n",
+      "'Restart kernel' or 'Restart runtime' option.\n",
+      "2024-11-25 17:46:04,004 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>\n"
+     ]
+    }
+   ],
+   "source": [
+    "\n",
+    "redactor = PIIRedactorTransform(config)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 5: Apply the Redactor to Text Data\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This step applies the initialized PII redactor to the extracted text, redacting sensitive information and providing details about the identified entities."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "redacted_text, detected_entities = redactor._redact_pii(text)\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 6: Display the Redaction Results\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This step outputs the results of the redaction process, including the redacted text and the details of the detected PII entities.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Redacted Text:\n",
+      " INVOICE\n",
+      "Apple Inc.\n",
+      "Invoice Details:\n",
+      "Invoice Number: INV-2024-001\n",
+      "Invoice Date: November 15, 2024\n",
+      "Due Date: November 30, 2024\n",
+      "Billing Information:\n",
+      "Customer Name: <PERSON>\n",
+      "Address: 123 <LOCATION>, Apt 45, <LOCATION>, <LOCATION> 62704\n",
+      "Email: <EMAIL_ADDRESS>\n",
+      "Phone: <PHONE_NUMBER>\n",
+      "Shipping Information:\n",
+      "Recipient Name: <PERSON>\n",
+      "Address: 123 <LOCATION>, Apt 45, <LOCATION>, <LOCATION> 62704\n",
+      "Item Details:\n",
+      "Description Quantity Unit Price Total\n",
+      "MacBook Air (13-inch, M2) 1 $999.00 $999.00\n",
+      "AppleCare+ for MacBook Air 1 $199.00 $199.00\n",
+      "Subtotal: $1,198.00\n",
+      "Tax (8%): $95.84\n",
+      "Total Amount Due: $1,293.84\n",
+      "Payment Method: Credit Card (Visa)\n",
+      "Transaction ID: 9876543210ABCDE\n",
+      "Notes:\n",
+      "Thank you for your purchase!\n",
+      "For assistance, please contact our support team at <EMAIL_ADDRESS> or 1-800-MY-APPLE.\n",
+      "Detected Entities:\n",
+      " ['PERSON', 'LOCATION', 'LOCATION', 'LOCATION', 'EMAIL_ADDRESS', 'PERSON', 'LOCATION', 'LOCATION', 'LOCATION', 'EMAIL_ADDRESS', 'PHONE_NUMBER']\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Step 5: Print the Results\n",
+    "print(\"Redacted Text:\\n\", redacted_text)\n",
+    "print(\"Detected Entities:\\n\", detected_entities)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<br>\n",
+    "<br>\n",
+    "\n",
+    "### This notebook effectively demonstrates how to seamlessly apply redaction for PII entities"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/examples/notebooks/PII/invoicedata/Invoice.pdf b/examples/notebooks/PII/invoicedata/Invoice.pdf