From 3fee79325117032bdc77ea9d5529946589221240 Mon Sep 17 00:00:00 2001 From: SHAHROKH DAIJAVAD Date: Fri, 15 Nov 2024 14:02:08 -0800 Subject: [PATCH 01/38] Update README.md Signed-off-by: Pooja Holkar --- README.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 2c1caa04e..88bed2721 100644 --- a/README.md +++ b/README.md @@ -133,7 +133,7 @@ The matrix below shows the the combination of modules and supported runtimes. Al | **Data Ingestion** | | | | | | [Code (from zip) to Parquet](transforms/code/code2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | | [PDF to Parquet](transforms/language/pdf2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | -| [HTML to Parquet](transforms/language/html2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | | +| [HTML to Parquet](transforms/language/html2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | | **Universal (Code & Language)** | | | | | | [Exact dedup filter](transforms/universal/ededup/ray/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | | [Fuzzy dedup filter](transforms/universal/fdedup/ray/README.md) | | :white_check_mark: | | :white_check_mark: | @@ -223,11 +223,11 @@ If you use Data Prep Kit in your research, please cite our paper: @misc{wood2024dataprepkitgettingdataready, title={Data-Prep-Kit: getting your data ready for LLM application development}, author={David Wood and Boris Lublinsky and Alexy Roytman and Shivdeep Singh - and Abdulhamid Adebayo and Revital Eres and Mohammad Nassar and Hima Patel - and Yousaf Shah and Constantin Adam and Petros Zerfos and Nirmit Desai - and Daiki Tsuzuku and Takuya Goto and Michele Dolfi and Saptha Surendran - and Paramesvaran Selvam and Sungeun An and Yuan Chi Chang and Dhiraj Joshi - and Hajar Emami-Gohari and Xuan-Hong Dang and Yan Koyfman and Shahrokh Daijavad}, + and Constantin Adam and Abdulhamid Adebayo and Sungeun An and Yuan Chi Chang + and Xuan-Hong Dang and Nirmit Desai and Michele Dolfi and Hajar Emami-Gohari + and Revital Eres and Takuya Goto and Dhiraj Joshi and Yan Koyfman + and Mohammad Nassar and Hima Patel and Paramesvaran Selvam and Yousaf Shah + and Saptha Surendran and Daiki Tsuzuku and Petros Zerfos and Shahrokh Daijavad}, year={2024}, eprint={2409.18164}, archivePrefix={arXiv}, From f727f8a4d38b66f78879d82e05937470b862e635 Mon Sep 17 00:00:00 2001 From: SHAHROKH DAIJAVAD Date: Mon, 18 Nov 2024 07:14:38 -0800 Subject: [PATCH 02/38] Update README.md Signed-off-by: Pooja Holkar --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 88bed2721..6b82e7a6a 100644 --- a/README.md +++ b/README.md @@ -134,6 +134,7 @@ The matrix below shows the the combination of modules and supported runtimes. Al | [Code (from zip) to Parquet](transforms/code/code2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | | [PDF to Parquet](transforms/language/pdf2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | | [HTML to Parquet](transforms/language/html2parquet/python/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | +| [Web to Parquet](transforms/universal/web2parquet/README.md) | :white_check_mark: | | | | | **Universal (Code & Language)** | | | | | | [Exact dedup filter](transforms/universal/ededup/ray/README.md) | :white_check_mark: | :white_check_mark: | | :white_check_mark: | | [Fuzzy dedup filter](transforms/universal/fdedup/ray/README.md) | | :white_check_mark: | | :white_check_mark: | From 06f91a38ca5163a625dad5df27e17b13f6a2cdd3 Mon Sep 17 00:00:00 2001 From: SHAHROKH DAIJAVAD Date: Mon, 18 Nov 2024 08:02:57 -0800 Subject: [PATCH 03/38] Update README.md Signed-off-by: Pooja Holkar --- README.md | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 6b82e7a6a..716f3b0f2 100644 --- a/README.md +++ b/README.md @@ -122,7 +122,14 @@ Explore more examples [here](examples/notebooks). ### Run your first data prep pipeline -Now that you have run a single transform, the next step is to explore how to put these transforms together to run a data prep pipeline for an end to end use case like fine tuning model or building a RAG application. This [notebook](examples/notebooks/fine%20tuning/code/sample-notebook.ipynb) gives an example of how to build an end to end data prep pipeline for fine tuning for code LLMs. You can also explore how to build a RAG pipeline [here](examples/notebooks/rag). +Now that you have run a single transform, the next step is to explore how to put these transforms +together to run a data prep pipeline for an end to end use case like fine tuning a model or building +a RAG application. +This [notebook](examples/notebooks/fine%20tuning/code/sample-notebook.ipynb) gives an example of +how to build an end to end data prep pipeline for fine tuning for code LLMs. Similarly, this +[notebook](examples/notebooks/fine%20tuning/language/demo_with_launcher.ipynb) is a fine tuning +example of an end-to-end sample data pipeline designed for processing language datasets. +You can also explore how to build a RAG pipeline [here](examples/notebooks/rag). ### Current list of transforms The matrix below shows the the combination of modules and supported runtimes. All the modules can be accessed [here](transforms) and can be combined to form data processing pipelines, as shown in the [examples](examples) folder. From 581c1e9797c4d537476737e376246a36251acfa7 Mon Sep 17 00:00:00 2001 From: Shahrokh Daijavad Date: Mon, 18 Nov 2024 11:25:03 -0800 Subject: [PATCH 04/38] Update README.md transform => transforms Signed-off-by: Pooja Holkar --- transforms/universal/web2parquet/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/transforms/universal/web2parquet/README.md b/transforms/universal/web2parquet/README.md index 1841403a7..98af08cc6 100644 --- a/transforms/universal/web2parquet/README.md +++ b/transforms/universal/web2parquet/README.md @@ -30,7 +30,7 @@ pip install data-prep-toolkit-transform[web2parquet]>=0.2.2.dev3 If working from a fork in the git repo, from the root folder of the git repo, do the following: ``` -cd transform/universal/web2parquet +cd transforms/universal/web2parquet make venv source venv/bin/activate pip install -r requirements.txt @@ -49,4 +49,4 @@ Web2Parquet(urls= ['https://thealliance.ai/'], depth=2, downloads=10, folder='downloads').transform() -```` \ No newline at end of file +```` From 2df4ada3c3938c8bdd6fb84bec6c58999bce8391 Mon Sep 17 00:00:00 2001 From: Shahrokh Daijavad Date: Mon, 18 Nov 2024 11:37:52 -0800 Subject: [PATCH 05/38] Update README inweb2parquet syntax issues Signed-off-by: Pooja Holkar --- transforms/universal/web2parquet/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/transforms/universal/web2parquet/README.md b/transforms/universal/web2parquet/README.md index 98af08cc6..552b22a25 100644 --- a/transforms/universal/web2parquet/README.md +++ b/transforms/universal/web2parquet/README.md @@ -24,7 +24,7 @@ The transform can be installed directly from pypi and has a dependency on the da ``` pip install data-prep-connector pip install data-prep-toolkit>=0.2.2.dev2 -pip install data-prep-toolkit-transform[web2parquet]>=0.2.2.dev3 +pip install 'data-prep-toolkit-transforms[web2parquet]>=0.2.2.dev3' ``` If working from a fork in the git repo, from the root folder of the git repo, do the following: From e50ae589d270a7341ccc3dfc4a8c5f6ea6593aef Mon Sep 17 00:00:00 2001 From: Shahrokh Daijavad Date: Mon, 18 Nov 2024 11:52:34 -0800 Subject: [PATCH 06/38] Update README.md for the web2parquet A few syntax changes Signed-off-by: Pooja Holkar --- transforms/universal/web2parquet/README.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/transforms/universal/web2parquet/README.md b/transforms/universal/web2parquet/README.md index 552b22a25..1a8ecb408 100644 --- a/transforms/universal/web2parquet/README.md +++ b/transforms/universal/web2parquet/README.md @@ -21,6 +21,14 @@ For configuring the crawl, users need to specify the following parameters: The transform can be installed directly from pypi and has a dependency on the data-prep-toolkit and the data-prep-connector +Set up the local environment to run Jupyter notebook: +``` +python -v venv venv +source venv/bin/activate +pip install jupyter lab +``` +Install pre-requisites: + ``` pip install data-prep-connector pip install data-prep-toolkit>=0.2.2.dev2 From 850d10cdf2f988be4e38a7a4ab0614202bf7b8ea Mon Sep 17 00:00:00 2001 From: Shahrokh Daijavad Date: Mon, 18 Nov 2024 12:01:55 -0800 Subject: [PATCH 07/38] Update README-list.md Signed-off-by: Pooja Holkar --- transforms/README-list.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/transforms/README-list.md b/transforms/README-list.md index 3e70b6b62..5567a5576 100644 --- a/transforms/README-list.md +++ b/transforms/README-list.md @@ -36,8 +36,13 @@ Note: This list includes the transforms that were part of the release starting w * [tokenization](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/tokenization/python/README.md) * [doc_id](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/doc_id/python/README.md) * [web2parquet](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/web2parquet/README.md) + +## Release notes: - +### 0.2.2.dev3 +* web2parquet +### 0.2.2.dev2 +* pdf2parquet now supports HTML,DOCX,PPTX in addition to PDF From f1a5ed34cef7b344982cba978226838f4b1ef654 Mon Sep 17 00:00:00 2001 From: Shahrokh Daijavad Date: Mon, 18 Nov 2024 12:02:54 -0800 Subject: [PATCH 08/38] Update README-list.md Signed-off-by: Pooja Holkar --- transforms/README-list.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/transforms/README-list.md b/transforms/README-list.md index 5567a5576..8040dc7a9 100644 --- a/transforms/README-list.md +++ b/transforms/README-list.md @@ -42,7 +42,7 @@ Note: This list includes the transforms that were part of the release starting w ### 0.2.2.dev3 * web2parquet ### 0.2.2.dev2 -* pdf2parquet now supports HTML,DOCX,PPTX in addition to PDF +* pdf2parquet now supports HTML,DOCX,PPTX, ... in addition to PDF From 040b9d2d96c1a66abe869f4a05d024c57611f13a Mon Sep 17 00:00:00 2001 From: Padarn Wilson Date: Sat, 16 Nov 2024 14:43:11 +0800 Subject: [PATCH 09/38] Update README.md Signed-off-by: Pooja Holkar --- transforms/universal/ededup/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/transforms/universal/ededup/README.md b/transforms/universal/ededup/README.md index 9a112e816..c12b68470 100644 --- a/transforms/universal/ededup/README.md +++ b/transforms/universal/ededup/README.md @@ -1,4 +1,4 @@ -# Exect Deduplification Transform +# Exact Deduplification Transform ## Summary From 49b4022a00dc758944535288590e3544eb6ace78 Mon Sep 17 00:00:00 2001 From: Shahrokh Daijavad Date: Mon, 18 Nov 2024 09:48:42 -0800 Subject: [PATCH 10/38] Update README.md Signed-off-by: Pooja Holkar --- transforms/universal/ededup/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/transforms/universal/ededup/README.md b/transforms/universal/ededup/README.md index c12b68470..0390cc19c 100644 --- a/transforms/universal/ededup/README.md +++ b/transforms/universal/ededup/README.md @@ -1,4 +1,4 @@ -# Exact Deduplification Transform +# Exact Deduplication Transform ## Summary From cc27ad37b53548de558965c5046388f1d5847339 Mon Sep 17 00:00:00 2001 From: pooja holkar <37286638+PoojaHolkar@users.noreply.github.com> Date: Tue, 19 Nov 2024 15:41:08 +0530 Subject: [PATCH 11/38] Create test Signed-off-by: Pooja Holkar --- examples/notebooks/PII/test | 1 + 1 file changed, 1 insertion(+) create mode 100644 examples/notebooks/PII/test diff --git a/examples/notebooks/PII/test b/examples/notebooks/PII/test new file mode 100644 index 000000000..8b1378917 --- /dev/null +++ b/examples/notebooks/PII/test @@ -0,0 +1 @@ + From b66e0de528f3fcb253330d6f6232533d4dca5764 Mon Sep 17 00:00:00 2001 From: pooja holkar <37286638+PoojaHolkar@users.noreply.github.com> Date: Sun, 24 Nov 2024 20:35:09 +0530 Subject: [PATCH 12/38] PII input file Signed-off-by: Pooja Holkar --- examples/notebooks/Input-Test-Data/Invoice.pdf | Bin 0 -> 33150 bytes 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 examples/notebooks/Input-Test-Data/Invoice.pdf diff --git a/examples/notebooks/Input-Test-Data/Invoice.pdf b/examples/notebooks/Input-Test-Data/Invoice.pdf new file mode 100644 index 0000000000000000000000000000000000000000..7b372f7f291e713de14d5b63ffab811c7935b43f GIT binary patch literal 33150 zcmeFZby!u~`UXmOHwdyoy1Tney1Tm@rMp8)K)R&6q&uWVLZnN&K}zld_2@Z!?-RfK z+&|89k3O!o=9puS9`E?R_j{+L@*-lijC9O!q&*ua8@q+4S>ru@aLfP(fSsWw91jnG zUd+PU*~Ag}ZEfIeB4T1>XKVtXmoc$5b2bMsGcz&(`1s(QoE=RJY~b8M>$GfQKG&mu zE7v0`l&#)!HΜ*tR6+vNmdelII+ua|1g z6G|(Fk-K>8BXu=OE-&@k8(a4+m6pCK-sdjzUhNeh>?|DK-pk$H%kf<%)0gk~OgHdP z+u5~J?VhR{EE+S-H=8oCsHKn8-JBG+-Hnl1)SniwF>#lMV6hB?N7M+OFT_Xm%pH3= zD2>~Of0}DDb2u>G4_?;NmSJZ<1)0})_UU;_=CV#B91=YHkw&A(3zeWJ9Ha3aF5lqd zLkRiy__S>t$h+PO;fSVRXIov5&MFoXkj;M?BZ8oz+Zppl@xh~cV|WY^Qa-S|8O>`S z%(Lq+t}|7!+Y1@&w1IIam7H2+a+f0wPtSWJb$hPOb8QgvW;ne$8Twkf`@oy1ohSJT zQk>;XYScj?721N`!qjk00Bp`T75f>eOD6?9A+EKY19pOxTp>~+a~s$80qT)mE;{Qc zWhEBbYzb4_JtwWR<4BH$JxJ4Muw>e+L$VV6`UgAu$28N-!v%utb9h+Dz40?OY30R6 zgME`QpqiCSq_)ocvoaqzAe)R`cH0I->nm#lDmi^S-)@CDZ*ehVXuXc9V3h3h}qo2==E~>^9Z-0^B z(++J9T4vtuXCXe?94ssK5*ERxo*A{U>*Je2G05>PBGnNV+G%)25eOwyMtn;r{Vi^J zEpjI1^Ojm-NTeMEarp*R+h~Vz5@CW+W|peb6$h$nfvcrVOhTYb_;nSgnrdpf*G!>~ zc3o8j`emZdN}KQ*)G{;eCGL}JPprntzWp0N)x9TC=+_(*_W37T6iI^2V0-f>?HiX$ z*;XGiF|ciWTG3D;9a}bi$E!E@i0O9IhfH$9FVqZ%PBmLQb1v;oi!Xc8-h^?pS@LmXMPas(@lYF=+v&mPC9>q#L2~i)I;nQ(Vi3MuRMJ%C2vq6qQ5)fmqAxKhpVq=d?QhKAgOS0h!*cbmJ#SE@x^6C75%bdYb|%* z{5|;&I*BoE4Ei)a1SSf`^ElIeR=rpoAAVMsmcMSNOI#Rj;{>Ne>s&|6mPCz4KMyvK zYL*euS)Vm^{*fA^Fy4hgVkYiFKELO^GeK6#Hk7rD^dNuhJ@xfWd}8lrd*{k>afv)) z!;scCtr_+CmxO0R$G71EAL{)*%_OvURiE#Y9LaIIVi2f>S9Vf9&yb=BK4dAXgQ2e7 zj{25&<^5~}(fi25TXIdam+LKhR`eSMHtl)V=_rZUj#XMzIqDTW`GFk@rI!*TueM)= zrOY;ed!9wnCAxPg9YT>vJX4)K(Q45{W}=B@wQX6FE*8Wkf?qao1PL0-eNYT$)o0sN z-`=eOH%A1WMq7^NXH~vqis~L-BC@Kvb2fWIcYGvZ6d`;e97XGv{N%~Jz3P#^k=9{S z*;;mA5^~p5b87`;mcD5AdX-2SL`oaCS_e%_@eWMMTHpN3Jxm7NLyn`jc}yG^k%$;0%2bolaRY_kGN@y1n|&j^Mi1vKV|d7 zmSK?x;UVVr?oz$NDD7jPlRjLpeZA;}6Hz{lF$9%q$HlqZRpO+57WDDy;rO{ZRm*zW zB)k7vbz9mFl2;n;^C+c-`@52FFYlK2?T+S7W&`SJ2~L+j9<#b1n!XWK8nf$D6{ZIZ zmGici8t`@(mWkq|d%JkEXh(Hff6*@txT<`gG1+Du(?2Hrc1xkc;7w}M=(XAuv1hVv zRH@M>=j;A62052v51jl~4}4;7*0IQp33V<>a^A*;PRS&ns!;DO&Rzh5e@8C-v=p`- zEDlti1%D4wj$&LH*GZFTZD8l^bGaKo7o{(h)G0aQV%HMK1@2Q4jMH4!IY(kwWQzgpm7kpTWiwr$(3Rmn0|jK(`= zTcF0C^z~FGh-z*jTn-?9SVqx+{PNli;Y_8#Z?bYFRxc1c4B=@FLcUqPM9iD5r+(H- zyYJ;m*U=>X$^8sVA%wWJLZdZ$gbZV2D?5S$oG5M_2)I;HWyXwNWwVs_Emvl`=1hy< z${GzAKh5W-3OM&gZy1PGzc6IQ^LT|a=g!WrAj42ZNL*`yNmTG!O*pX0ga#o)JrLDx zQx78lItg+>9Rj8`k|=lRv+zo6nr2^D*E-WMk6_IUJ33mzzOH8(v5bSu^S~1gv(>>`7J&Ogc#LLAb`<7^Vc}-wS*lRBKVih5^l4?oMtkf>4H) zJBnR~#Kvcv2pX;IjQNnwOhv<|=F3^sA!K5aqOn2X3ODJ<6ZqL84U;b+^5Z-bF9ZIe z#7rPKPdkWv7ZL;Xl#hqX%F~xruOW83?>WXq3H{2DGmO&QJRK0U6hK3A*6{HzSwm#XDn}1R!9)%WGs$DG=^$E?BNpe!~35{Vr;b<5YKyLY@ z9dl}DXKSk)lI3~2Nz;!DJ)3$*eZdr;6fL|DLmiY)_n88NI^8Ugf!;Ze0A_hZsKhAq z5(_n4PVl8Cs>Geif#Us%&&p(JXOU3=I@KFr|0!-r$PXr2zKt+hSrGAv=}>i%z9_de zD9aX!(PEg13RfC9Ued3QhGuKvt41TBg+N-uX~SV8mW6>+FeG8#f(GhOr6T|)nL(Z{ zXB=4YY`BCroUqFf)nDCTu_RDpHIYIvQec#NLxb3Zm}oicOC>~^@!rB~nQSIByo%sj zbi*;@Jr6K1

_Umr$`JaqGCAAXzMm_oVwlNS!*QbkVfb1(%=qY)t@$1mmAm>^yna zlT7ufkU#9cHa0Fuydb8F8!4di(+(jas>)#(G_9M>qkOz6@zM{mj=%1;>P3!R6_Odp7mkkZ`mB9poWqScKGJr&jW*mY@qS8tMu(5IAqecZazr0fEDevw<-~6t5EGd2Tm#Zw zl4bT7eQ(-gWXRL02qxYzg4LCvG)pj??C_8-F_VjI2lAmjv%slt^JCSpAk%!Olysqx ziuqBg2pv{qS?aYBYi9^rQfUkd`6sg&;&!G${G(B!1S*#T&<_yk_NkR1xQwkk!bStn zZX|W%`X-+bg01g~y`fk&FAmh&r6Dj;kh8IAI3b_s@>63j-MhA@9r>-VG2%6id)8~FIFJ%ZOB%+wqi^>HU8$RJCbq^u93SAX z2X~3_H@8XI!yf1-$s3qGema`iIs=%1w-o{OiY88WE{;YfP5_qg3Bq=^&cORlfCsk= znBlpJv4w$Spb}mIe$t79DOc50)czwfyO!$AQ>QU~o=F*St{|0)7D zb=%6DlJZufujb-_&eynko~B)%%xU`GCytn1m;?$0PV^%ik2e|i23UepO)oCfJ;cyi zA|+EWBNJ9OY)&chT2F=bMd2oO^Wf`bvh{2{jZxedmW;?|`NdSymE1^(Z0*;i%VxQr z&?c?QlBpmYmlPj9e0?JW!hn&YH@robZ5_rWih+9&b^-07ZtE=V-f2x{6ARX}QOpaeEn zO63j#ASaL@wTo{~p9Wj&zUo;;Acv3G+p~58&9tJU{=$Bf>2EjG9?W@}^8VmPmAe3& z-|v|uCbNm?MnPeLw8-!<0#fpr@!ciL{aOd7#mHbc&6~^)?Bk#>Dhgi`U<1R%t8R7T99uR1Wu4~7SuJs(jH{W@5MNTtUsGQls1G~2ckVZ zg0Hv$TyUo}1L&fk1&sd@3TBRAaV%CIa*EJ2$`ef(4`D%h4h-;K;jmbgR{~e#WN&e( z0*bSR<$2GZIpe*9(FwTDf*uF3K;40Q1){zN`(S`L3sUbpSILbJBi{9XS;QUl4K#Mg z{EAc~79UbW7xv1NBM_P(y{>o|fdLrs2{Cn;C&wZ}vFaqCs6E zBF3RVa;MXgp(jxroDaUw)1?@~zIO6hSWO5ke`KO00Esn;G728d7R+5BY8Q)w-&;9n z;$Rey9&JO)s_0q47D;?E*?0`eU=nOa63UcXKV?ZvVjVJGvR0z_K`vu)M;r|SUAc0S zOn@FK@+*G%1PU??IVpTmp*dMjsU`sr0T1DPMZ~wZQ|iU9-=%BHwTgI2dC`l+bY-+UK=gpG-b65Qwwq-F%4O z?mZ*?(s~AKLTaM5L9v0qfwxgJqzKi;p#POw5jQ_FKC(G7=9Fc#X>4t5o~^<{_r8TP z@rufp$}^ET(KHb>F{W5X#qJ}<$Mj;GV!0^>OEOCX%Q?%Nk8~wbV})aG8Dr^sEcffT zAv=}t!d^69Tb_6C@b1LprQ)^W;o_;_CA0Lg;-=cB4yJCTZn5;&nrp(>L1=DhZq?~B zB&Hk;E1QzmL~v%8mG2kQPVnU|l<<}ry{%C<%ayGeMb~*nr^PPVpxvP4CL||KuTHO5 zp>UYZr;u5enb$7fuH`e-F%I~oA0e35`8GQ|yWJqeAa}*+S#ey+1XZqPE_WY?QN2U> zG2faNS!GzdqZ^f+=2n*1es z)d$JKrF4x^%2CA8TzCZRBkU?xJ`#5K*`>*41!QCqG( zHa!coymNPrH`K_|$Z_Hk{RxHfTXUN4-^9Bp0C+{i(2K$OCMfy;vsgHVO$gJ*(8gLHxD?gZ^@_9vvD)FaWJ z!U7Yz6?P5cfn`O|!Mt|Dp`&3bCEN38RB`uqcY~lc6crKfX_kl?m>dv~h{BS_auXXA z?-l(h{#^8>XtJnuvO0}M&AM2mW+b417mM%-=1US+$D+&01?iBry;Zoi13G5yq?)}? zj&&PF@5jnx^*S{isGdF{Bv?YZ4@2)N+K?$f(`$0ufS8GI4!L=?)knP*b*%H% zIj}l#LL|9IC(KB)ToO}q4NOk3ZrP<;@5FDtC?Pf>my)??qG-e-u=Z>12O&~}@JebO zW?u4E`fiq72H|%@79S5(&~pSP;$$eCG)$XcM}Lz!mxv@uBMs1iud%FNG!r+;98wrY z-8@@QUAKKtFnBt6gHnB=K3x8wv+Y13aKn;x?Fp{N!CrL^WDmM+ir{vz(!*3 z=)KtcD3d2kw&9CY3Rw!3ZTjXlhnsyuO+pGQt+bYUDEJUyIL5&2;6VYzsSFFTXuzU`@FFI?o>U+P(ycaynI=wY-9K&=cJ)o zZLdw$K+8cN*&@a`rk3e#b*of`rTE9foxB}CyqMI@)}5CP`{|$h=V#iNY&qrim;y7+hJV-phzbnS&Blpw{`&;$| zwyvmGeN}4?{dRZD=fW@1OSa{ibDO?Ga1e1Gul9c>Jj_7gnezzuD&8i)%e<%`*9_Op zp8h-?R-eaa@lxjH(zfr1bb?U=TK=?q&$Gyjl+>CbGwHFoQOq}KoSAOkCqjKe#}Q%J zAzW(CeD7-4C)3L-%NO7KC0FPkH#ru~x8B^UB$t?#bLwz=rF*sR2j4|aKW)xP(%bX4 z*seRYY2CK3xl39ri*AGT+Pc=cpUPUgvs5~67;zUWNx7$stnc-m1k=4X*3cu9i+cwVY z=?j}J<1gzU+$;_Y3d0vm+W0U0?y_%;&idoF_vF6HC1ytPyLrOip170Qm~Jjz2AQM)6-FkOzmKStsn5Uq%!1-IqZUfS z;QDrAhmKo7->usQkPI0DGRI@n39gb&wRf*Jq1fOV*UEjNZ26R>+N9~`Z2JSP&$8Mr zI6K+VGG2M)n&!#f+sbRP~cps=K}X-AJ^8WabDms+;43Ffjeyoy9OEg3ky0(l9g=$cqCyse!mzDD zgMi*L%%kt>dYkVDzM>*lz(Kuo_{KBguNHzW0e{9=Af3}Y$R=1_6JZ-DCKu5}6j3Yh zIF9S;KT7kVi;7ya1jMAF10E$Ij~U({41y>lFgXm*!{!y4DsKj#q!oXWU^kUw1$3iv zj|fO)8|(1=eXK|qeONIGp{tx2B@vMzL=KcRqkW*UZWJ~v+#m$7?}%3 zcd(9^#Yu7C-s0yXPKcaEX202b!ibJROvj6i`fx%airB^@f%f-xj_~wyn0j8a&3+-9OT)rC!Lz-;a_OCy_D= zUrvj6?1d)Pgw^SNx;*f?K=@PYIBG+(keTpk8*O$jVvIz(?w2#~G0>D%$vBcqb3mLl z^YlzG-xtYn7$<-BjHX~kVZrc1#YPF~9ub{w;tApinvWGZCle8;InEfeTO*8#1(8`^ zgM{zIO(z(ZI}L5OA`8h+YuWiYqr4bsiE2)y8n2@_x-WGS?eO`vB;pYS%`F6yU50lY zjG&7y?c3*dW*FV&p(nXJzwlck^=j?8qFTP@Z|pdpa6v4%2zpJxXhzX9d zJ;KBUjG%vSG%y^D4g%l|LOZr+*}n9*7WjAa(RW|F}HTQ1sFk z#!f(T=zFUkM;>@q{l+CR|9)J3pA2CGX9H_Hv+tA^kbe4y;d^Y`&*2jUF#nhwAv;H7 zpm;zFnET-`unpf!dz=LTy^@Qe^CM3tV_|Fcz>FCIIV&bsHacb|04w7I;RPHM209i1 zD-$yvko@{Z6g^h(v%Vi?8aZ3o*#Zj%(5om)e*XX-rhh-pj|mUYQP>y&g$;IGPA>NL zc8<;h2KM&WCUi!2Hk5xOoF0bfXBC0~mLGHYf9Vn%D;+BnfQ6a!p+}6IoOEm)02Vfm z-#heJ{Lk9{Nr(P%;8`9Le(#W_ow+TYv7L#4iMxT#KlbU5o9vPG`!x~DjxHwOYkx3g zzTXsl_#;d$=xAYJO)F$)ZTwvMJ6kAX;$-A#Vef3`2*>hs9owncT0Ck*fSl#yb`-O4 zbaECpH*kDd>+~`PKQA}|^lBEy&gKsbk%^rHzz*CEY;5ds?7x0LrvJV!AItnXNIx_v zOl-iC{>-HMo#)eu8?o$WKnXm25zbo>qCa<)tB*iRx-9JTb{@JizYcY(I(8#qsqGo* z^Xw2#&t^Oe*TnSQtxu@}bIqB`-F`2O>b5*ITMT#H8C2IHEpoe}6rV?xq2msZa6vz{ znwL{f@~61*H9_*uABoT4@(!(Iub6wnJ-)g7IZvvF<8tKLynv!Irp1B7g7mKLd>p0p zv`yIryTL-&x!bCfi;M(>xOY?@*+`8Q=FKz`jTB|ciDMCj)?J0mJG^(_T&o94n52g8 zj7W!aK!gOM8^MJXpGKoz3cxJzt%AaUx}nAhz_^FHg9#~OF_B(*F$aggZ$>V`I%G8a zWUncRKce2|ctqN9cW=yPu=_eexeB{@3h&pKVzK8z9sA=ZnGG!0ds} ze&mpafYddd7A+ePk7=3Nfut`31CStQU}XdDen}vRSQrV~npv9wR|mbIlhFe)&dI>~ zK;eGBpk;a>ZiNl(B}^>L%$?!bfxI!jlCz18>cbNS?!UV_i6Jmev^2daqo&Ub17y%p{z`p!gUu?h~_Al30 znc{*ph5`n^nd!k^co4U(rNxA`7FnpsL4A(;CiI2kL_jE(wTz{0?d~U2Lp}+4LBH`2 zeQ3&x5U3CuuNpC7cro(iuy~Ct0TkiLLP2b!i~FA~>U^@8oyPikx7-e{7E;<~6Y1=` zSYtuCh{fn<{lSf;n!DA;%+?q|WCGtsjVz0u%SQ$vd;JHG0M` zzLqy+ed$A$XDT}j`)WZM3>zP$s)49Wt+3#d|24lGTg64P zQ^cqBcb}k?Pg{iN>je@} zC-mml-~`>0wYFwjCRmCw;u9E^b>S>HlDu!6s)eCPjTtCL2O~_EXd_aEK{yqu1AcBu zF+Ri;&nG@b347^e^0HinX+brzUR*I#Kgc-*45bd`-jaY#uozus^&Tnd)|bj7Dql(v zC|nww{p9FRHN4@WCt|cRT~$x)dgZPOeXcns?`5bsYpO%tj#w`AZw-z_o%Z_B*smr} zbR5L+1vCsK2Oy6uGQjY9bsh3YH4_vqbS|pa@qB8A=*tw(83g;@rhelno(`^xBporG zAQvocNf@kRmXJiYgq_M~b3U!uR!FL)TF0%?@U~)|pcUYD&;re*HK4%_bF(vHXoR}a5J5OxKe=H@yr_4g}HZYk$nns^`{u3UcgdP;VOsZ)+Q#~fR>WS4Pzq;dHtD`yt3jT=&tU@QEJbhkO z3!iHzj2eqbcRINy#W}Rwi&o!1Zt8~v_~)|zwWohOa{m(PMgIlTf47|Im46}qce@0* zN&Xz^fw2E~NdI?;{`-ma6VZWTqJO#LbYeSgy8$R7hn~GpB41Y9 zB+xt)p|xBc?*77v+6`i#Mn(i#^&Or<3LP=-03L{3jUw zhiUxp!RYUX{4ZmV<#%)RCq}cd19AE1iUA%o|8m7pQ&K%QS=Kk#*EiLd7=2Eo4+7o? z0U%AxP)dQ?XaKP&i=67gfXQ4zj%9lf0w=KX+_zRr%vKUIfe_H9TcAxy*U*+q`3?$D25O)bXAF_r3vot(J}W~81q$T_6^n!5J8Nw=T0HhL31iIe zG*S2!I8Du$swkRJ2e#$nuN8Ve-{2VRF+k59y3WDQtLE=cjqI`Xzn<@^C4zu!LzNOr zFjRpCy{75Z7JEyB4E_WhSzZKnAf!nsK@cMkug3&ZNS~$>9215Lt{4nf3<@ilA8r`> z8%HJ6t*N{|Cx|j6L?nWc{GByI7Xm^wV7Yi-Jt5_F?ruV6{OS`hWeAX`?^!wjxXc~{ zy#F`Y5dRmj@xvqd1skm2ZS(&VY&<^uvz(U; z-w@-6lld2h2FLG)2GErRu(ATZ4<;tK$8Vrp`QT{vaAhCr`G1YVoY{G`&jRw@s_N2K6=z*hspeFhow-XEeX4 zoZ>Lxomo^rQ(VXcA^DGOplT~#Z!sM(Y4VLAv>`{(5E|iZO+Nf%RHKN)9S6GTL zjc^b)bmg6%bJ0}g{zhMRz{>q8f_zH=sbCL@>A=Lqa|K|cDXf?_q}Zp>3QPOVGU9Uo zx5B{d5jJeKa%F{NbO%fpe#%vN_Z>4OR zAkXV&=qmH)jWzH`#bYsi1CbI(0RQ9eecWh&+`a$gkobMm{^R)o0J{EctU%WvsPFpW z`ZIF^b!-2*>ks%-*B|bIM*0iaA4sD8rRxu1{Z2^zc0T+rnfqy00%^xz2Lubl1E%~d za;Z!`Oc;G;y4}t6Rihyo$ubxY80>)%iG!#^CnAKheK%yHmx3T=HN_1vAV*FTOG5go<^=L|3Fhuuk4dtnW`41qsNx)_%*a=C%4Jnazdbb z;Izos3h~<*2VVtD9wY+9}C zPCK`oQRx%ltq{&XCtL_PUzWHJ1zGd1(Q>wFx%X)?%wmP2EqK=%BWTD1H(^HTbEZ4{ z>0&^x*2Sbf%@+XEx!|`%=!~DHl8F&o^7VvG z6tdrXG)AI!Ju}4@(q;|!ZwfCGDrcB+nizoY6UIR|_}Ya762N!ryMk;4s?)vr?kSsj zdPl>1z>#Q>KQ#r480_cd&gN5mA}`gM_+gH zV2Y8o`Bxt$mNUIp3R0nn=L=v=9|f3ky-7p3+1F2py9C!i15dYaNb-tBQQ#kn!%N9Q zHHT5`cmmIYkO32mju)*g5ia)ZNo73_qCm7c1zw~Wu~m;?9;^aAiBM}NId2?pwYcj5 zz6LRAwJ4Ku&~+}~TY5(Uv@Bi}9#gP4&@N9@;FQI)iU!bWMLpi)*BRleu7PNWRWP8L z5}^<97`?1{YK^A#-g~UR#*QOtckKDvYLi`4CZuJjrC{6glg=mE`Sw)|vaiW`=t{BM zP*TFRMRr9za$FU}kws%9d2jum2!jiQ_yF=@FcJqyFYG(B1h(`qgtjn1cLV~O;HStG zzHqW1z$mPx1NrZ(>Ow>Z#q%pJAN0KYn^$2%fAPUr;+`oc5Bp0Q>@95+>6P18E@`l;m z2B8DAYZIZuiH-u=Qn$yO7uD6zG8LXQxhF%Utud#mbuEUW+!N1QA=bwYC2GzVO>o43 zqPmv}@{Q|J`1`zhtMp!dHu>KF&-^YqR%Sg@*W(_qU}kTr*)-bk-KvtPIv_ z$80qv4Vl|mYOMG}CvJK5NBb{^2Yve)#xpP!7A z%(%Aj$?_5U_t*mc_NI&+J5en&$4sF8MgYO;_JCKZ5glGqnNE|fE0iyE4sx_ZhW3YT zy45l)AhGRZqCD){n3qY8>)#Yqd(~ycn+%s&qOFO4d*_{>bzw%84t*5s?^OzbN5!I4 z@dSYp#mlml8mXuhMXSP6WwQPja~EjcPx*eOrrT4;OO?4CQSe1K_RQqdxch@K@0@14 z@fWt4*e-8r_{zhz^wGFJ8P&dzcxj~?*X}cW;R&A+fo{Li8{Rj1^S&hM4egwglJFgb z2c$Qbl#NKPMyPtWL^dPq$LpFxr@_;vRBspSq0G6@&^iaUhEiOkiyv3azDhBUbV6nc z`;2#j#hS99C{d;N%lq*=QTUUaZdS(}U{`1E`U}s5@bmesXsC;rKE2w#usKod+=;AQ zKHKZaDm|`oqmd{d&$s3sLc*^mKU+=@fd-EVMz9;Q{33*-H_#r64ON(5t{tB;J-1QS zeJY_Sg^9szRIiXyTBH-T=w$Gw|Lt2r12I5GQQ#Oa>E<)tA?XHPa_J+z*f05=sTI#g zoed{8ZnusdDu#*=8E8Y;`=(XNNRn0|R~scR%NnLHtIiO+d2M-nR=w7-Pf~BiWC{4k zqOVS5lpN^lr~-C6@NnM%UoA29g`|n(>Hv3@)Y95gc#1uZYNybgMKB@~0YEPDobu-3 zQ8)is8jCGc!szm}dzR9gQ_RwKs0e!A;-e5zVI}a*)=t|p%2(VDSWHE)k;N0y3`O@v zFoW%x3>vLx6c0%;i8d;7u@WO|ck5ld@uuo-lh%G^^v5h_d=)4}qjz5>bcw4Xbn zNBq`w2USewvC?)WUbTPRO!8$;OHt9oPoYR4?Un9~&{H1Er^P!l z3pJNA(pIp;RL%>>c0JCxARm$iGQzWp0gwqukkn=)>!pgM7*&ziTs2?6F?J{CHyoF= z+SUh5`ClX3Hu}k!C9QV7qd#%%=nF#Wmh=Qj#`%5v~2kW*+*P_?zm+%9X4aL&}7hcUDZ)y76=Esg&5 z&W>m?)Es*p-MJK=8PTyC4(Oeas!pGxwl!OnKPbx++NCq@I(%Eq{{h1E6W;< zqW+WSVndSV8JO^TahGL>L{F4+y z-;+!(Dk9&Q4yeV8^(DjEtd3W!)HF&y`SCD2+^?F-2tPOGMU&+Aa)9sc!!JZm7)~$4 ziqR$eo|`{kjp{|1+eU438iYUL9qc+rBz?a?@Ij(JYq5uAS*B;ENB^y*QWEn|*MKKfxe~B&1wMg~y4asg{zJ_9-GVG~s1AdlX+>gxJ6qUswKi z&}sP4pt)e{mRnkcPjz)eS_5n9;WtCA7_y5FWf?DJJgxWSiuF2 zG4KJt#B@R08g2=aBecf8&@9)W@_Py1dgnS3S;Rr!<&bo}J+e!olAgd^+)U!JY}OB}-ekmebuVQVrph z;UK8lwmj=A1fb9AE<>?p-9fFF4h!W5g*mW#UV_WwTNrVJZRxTJFgSmf(IkBG9?>rg zuu|~3~)VYxl4wmKY!?|H;C=sDRA+YMJ4Jqkt$IpY%tnBeYKU6 zc1ECYwnXjNlSlf5hJAfFeU;zLc#n`^JHBuA1}UHi>1nQMndT>vzTWQ4N>kNj8Z&mvZzum-=DmFweX7rg%x|tFBsIt^K((&x)Vx z#alIYyIc6$@kG zKrDJqOM~RowCaxCd7oQ`Bq+Il_M_4iYdLoJPNv*npHeK%e|Pd)`egtDFKqqCS##DG zbygGB2ixbnhijLq!;6DgF8J0b>hRSCV1%@Ce#xvVOHu1t0rIeH%12xhMTCI)6+`8~ z%YiMbL@ol@D9Y-&No_?Jsg=h8H0f`0Y2b*`Y;I zQRko3^cxp!O8Ps*r$-8(fV)qMk)@cV9aK=Aqg_rSBc@?`B`9Hi99BPENiYvJWg>p2 z?N*Ox9`y#vp%{$IYo!FUt)Uh`RlDn%<8`pQ9a7!i8ewwWzQsc_wrk1EY$=y-HqB(J zUdJ2E!&G~h;a}#kG@UfPyHwAj#BclxMHb2V`PCuV*MwzSPzh=N^#m%|+WJg;qSNuO z>75!gLhQVeda3-1jwECkvlCGhQ>?7@L)*}0)N-sVC)U~3XR^g zuf}=y#dxp=NN*(<&OO3>WwK0S@kBYB>h)Ybg-UDg%HO}_)#*o~RgZPIEm{}J{1O`| zpJF}LQF|)dMwEz10%W4CDvon}z88bGskGcrR3rIB)8t<+5R zm6o^iE#7+7?u!uw%)NYEYdgU?igXKN>DpU%oO{fUQ0PWEZ2~?X+Wo2R52oGtrP9cu zI$%SuS)0@`WKSFsn|XqjVnvFw=EugL3qu}?v3~I)f&~dH8oUCSk(1X}P+VeMPG2U8 zym)DsnMFT67JlR$ZR(0AyTLhnrW>2t8yTPs`a&SV3+ybi?tSDm9W!fRzvNf;`?`iI z{0ko!l2=N2lm64Rkp|U~X3IXBvugHY2yLC}h)`4Jb@qk&D&tE9ajh{!;z)S5K09QT z9h{ekk>ql?&@zi-_`E6$g4a0TqPA#=-mkRNQ}`(1oRQ}dR8ZfIJl({@TqQiHI`ZLx z9+mndh4T3R+aD>Ee}@2-{%ZpCyH?@1NGp&l{HL0l-wDtMGVus=rPfzcASUqTorAangMxv>9_f~+WnT?NF0Ff0+U zU%1ICQ)Cdp5Uvpbj5r`wbE9v(aJqE~A7iYkk1^JFSdTH**@|he_Np&^KBJ4W|IRVo(cZ3iVeGXevie+@>k{iAyJG2tgR{$$MmhY|liMtJ_hh*rige_}`cz5#zNtbfOjko!xF zVEdt5{)G{2KSHbj7mRp(_V3pKFo5}Qgh~E?rr3HM^`GPPYbyT6-R z?1;a#BY@WRf9PxcK7;=~81db0{S71j=!^ajBmNVN0Q-dzD~x-8f)T%Oz+Z>Yzk?A< ze~l5}UB=(|2;gf>|BR2|1j;e~=4<>EQvo$ja1T1C2NBhS@aa(i^;4MjAclGnP5pNL zkoU*CK>#+OKK0?52SL`OP6}Ao!}srkte?Msim`rP|6b2yTYws?hdhtve6RPrAPZO@ z2M6ch^jAN{Uk~|ymibWD!}UWsk7WSc^(f%_S=jGyc0CT#PrUvChO8VM|Js+77W zKz^3QomM%;KTEJR)}Imv>(ut%O8#DFf$`8y=w-TDIe2sEUcorxIP&&gd1{{|G>#95 z5c8i2y@8AOKRkJU$A$l%ljl1b^4rPtyMy)9gJtCazG3k5T>{{Hng52EvP|7WTdBVx zv+0rylpRgT|A`O`2}OAzilBl+CuUF%OekC<2&gs)Dx`?3yvRg2H}Dcq0V^?-Fj3P+ z)Kbf;>}}nrLY14T>2KC$D@cbj9n)hso{K*8SHn7{qi;B!k`6hY;?bsQgt5$@8KP>U z5mv#a_zbf&Vv6^cX)T_Qj*3d);kMEkTFh|o_%@5PI6Jo*xixZ&_LJv3w?lW|H8dtl zzw1vST=w$v6p>>0aVA_OP!;K+-_1$R%stc@yN^*rXz9z*Ct)}>j>U-*s6FhC;e)rU z<%L5xp{d8anKtwd^2l2?J4d-5-}KS?)|M25_ACO#z3$eGZWul4SsT>Oso`9YzGihzTfcNFZ`NO`?dsM!r0K%0?hBSC;0fy zQOCiyxyRWwW^)t$vt)Hq0;M2l#vRQ~jEFs^S2+3g}Tr%Jg z!g_5SS;Ibuac0iiaDch$Ucmk&^O`>BC0|=GTvj{C!CVJ*=7nII%pt&uE;0&QliYnQ z*o@j7ZkG9AT8N%RI zeiG4B<|w7leVD!?i#C0Ed4_b^om)wd)DvM3%NKmebcaiT-i`BGRkaVQ=EVi3SudnL z=V#pVsF%a7tL}aQ9l|Jd98_PQ55Rov3hS~rXg#VLkYbTdAo)bg6#hwMOys`GzUs9} zoS}k^!^q^2qE*v4(}xq5(qZ|2H2DT2LC3sU3X)O#PJnv|_{$d_K4^C#c!@;Ec3xP~fI)rr`0hsMpph%%{CB+L!i6VBWf4ejFG>U1T z=58o2D+m&P=N!80)eT}^Qm?Ub?|%?#Dl#O=0|{B#6K&z{ zrh6LNiS1yrki=lWkZ$#+gYpuYg$b2_h|3RPph3ldbhkoWlu~W)nJ@X$8eR%=jd${F zRLlm2Py(C``dJAtkF(jQ+pT7g7nL?PCF~`)*P=8aQiPt-6U@gLEBTs1!y&klW2|hG ztWoxZ&`%kJJaFba`f*q)PO)W)8zHVM18@zkENPlpts(aERV90;G|{T0wkgm?BU)W3 zn&>QYzNuch?MkH;ANbsbf5BJPnBiQHF2u*Y#}|Pdt5f0AJ)wS2MTR`> z4Kr%zuOi?bq)3brUJGaoW@P>PIkkEBIFy9WMLYFwuxJJwd|y3~19e=;nEO2j6@h#z zfyTD^qDwNfUuPh(>{PaqbvpHBTluH>)H*_T$t5qR-7$-PO{%W{r@bqWhjRP>vSukn z_N{E8nAOZ6OR^=|MJT&$V^{VqWV^0COXSLyvLyQ!$`X+!O9>HSq_Gtdzh~U~S>|@T z_x|^Ny?S2$nUBvo&pGFF=Kaib&ilMSq|8*3rilix946Zq2ieyVO7?mBn@c5 zVYA0`mjw7j;0DHBV(#S%jU2995yWC}A^KGCo{gvbP z=^8xK)YbGvN0tfZGH+!(WeJtfjw?FS5x0oy@MphRC`xncve%8%z2J$10+vV%eoo4V zn&AS(3VA+~3GBoP0o27t(mW-!-3}?&WXkx3^BdG)^{rV|w<_DyZ^vpnk?Zr=t#vU< zjGP#Gw07?4NEK>gvg0Gaq7&2FQi=_)tCaf8hS4GB>6XJqWX80s2v>rPR-y|1KzTp$ zn47Lxz82A~S=H9$+4aM>IauF8MPe4LezP7M@VUKob7R22(K48<27O^+?ncwmJd91n zW63YkKHM#}RrNfn8!I2O6Irq6Q-kwGE|$7T=<3or71&<0J5{Pld{s#pDero=mO18z z+TG=%vVOUuNV#{D=c)7gL@`lk=4&&q@)z@EJAbBHC8;Y*H*w^e%ix@AU~p?jxV~_?$VYp)H$3t3ljisK$9zX>l{Zg@c7#WEo-*Lj9)o-98(G93 z6PjQWH^QR7W_$_>N%+Vjv3yR(nZI$hr@#F=+U3Hhao#l)`)!}sQ*3W;X=}=@oU>Cf zXF0p_%DmPwqTf*ED8)76aXsaZ;+)3$J4@&a1M}Z=7j(_3R;pt#w8*iq97daAK zcLlQ-RT?)Ssx4pa0!d$61s-vVqLY118RN4qI1gJVImMp0_EkYcaYQA#+U0az z1bYDL0iIkDEY4B?WF$6U1O%amazr=9+arZ+7bYxX|KbIaZDL6IybF<57DA`>Iq_Z54;@lCALr9{TK zFShFx;M)>+OKEOu()I}|lB1N{1aJ z_FtYgWXt|>In&fzN)JIXe%1ZGNPtE_$sKcZq-1#1zTC7HY-BEWKOH zw!r2?UTB|~>kBK8{Y)ND{_Y4XsoM_F@ARJueF#teoVm#IU= zi3$iSs6knpd6ngBQPJ$u_EQ1+J8tj7|ux=4dap44f9JROXmC@Hw-I;iP!c z3-~D1wSkwd--l5Iewk0<6>D45DQ<@OmNeIjmcvdV1mLxEr?WYyk5kbsUc_oi+vcHZ zh=hkq1qkTFM(H0vpDbWzwVU)4h9%@y+$$=nW%P}%d&u`f%9bjW@@D8O>{Rcw`Xyp{ zA=#BYr>^c*G`4xgJ)3BtmD3Z;z7Te5iJ$L6+L5eBkq$4lh4tOttvVt*BvZnnj2+!q zO$?i1e1(-sx0cz=2=$QM(>3{3Ari##$#H{@DXj!D_JMD|a!P9+rky$n6CGn<2|Xk$ z^LuxIK($8L4N7~xbLU6B!|?BkXVJ-)OR?m8>0(AN%j3a9G>U-`klbRAXu4d`L%f zFxrESiO5AHvAZNlgCwbf`VJ|{Auf{^VY{9IQB{@H>`1$)#$*8!X3X4$6~hHrW0}bM z!JhtN_d9m);^U3c3*hdfg4J{GbuG@8mYroYz18)x&QV93YGidj8`io{-MGozr0D~l z^oVMvHXsc%+>k#{c?Hzsc`6=4efO@qNr6^ISfAo!=CLQ-!OzY=e>JF~)R822Glbnn z5y`Gs?Lp}7oPkPkOurc8HQt*1j*gIf%MFwIV1Ta8X7TF*0}scMsj~jU!ZQEO4e|71 z5g9VI!~<)WOM4UX z;R#|3h)p}$lV)!>%zDB}!EzA?pMy@C8wwPsiZ&6kjeH%bVtog<<%dGdsU9k!j+K<(4b9lT+?zqUN7~JgHhVj?lyJCKDu=DQ@!c7izSH=~n$1jq$nOI< z+|!tGk$X*lPR%BwS$v64jx$S=FX)5qjloA*t(kq+)8zEF1lMIU)#cj9?#LdVR&Ii)|r`5hlnd6i>B`4#-gDS`}Qs0a5a4 z3RZ68w-mh}x-tr3u>EAQ-q^Ih_Vm2FkF}Abm~WbODFIW%paPuALGbF;t7AU3?!7z{ zTG?*87Ck**%S#u)$+ezhm zb#uY<1_Vx&+IMWS^w3Gn5w>~DhK(tu*agvNZ+@GNy*6ZDaal*Rkczc@wl{bzs{~Su zeAzh^&K($8YIz;*%!Mu<=xo{$&KBl<>T6H#TdpsqYRvBbS?rCYrNx_4MW3h1WN#-o z$2ktkn&j|uOlVbG=iFa(R(?;HRC9De^0hq{#;Hm=sz#JPf>j~ubbDkFlIcHGDOKRX z&ndjI!I6-&S+#b|eMl=Y{o=*+lq)TY#Z|~8iSgGy&l>|BKC(VQ)UDO5h`scd41G)g zc;J;6slTZ*<4M_Mf{%gJv}*EObm40uwPgy7UJtG;f#5C}hqmr_aT6^uM^s~(8iG^~ z9FV(REfi|CbW&vG2?x5m$D?}X3;XP`%AyW!U75wg?CKh-&Y76uFh}qN<4G2T7XT(Gm zmI$6#QSKJ|SW#v9Ti5h4UjG@!GrTz-yv=uKCKqXmoP_5o!hi!DOTwQxm{N*tAC^BX zEP1`$F{C+AIaqx`IdXPgX^>xRQd-=Knm_q>#it=aI4)YLD{@_G5~cNPQ39oq$O@yhAaE0obMh;IaI7@Zj^Do?-T z6f!&-$)jc&IMF1Z`|R{IR9y1Ir$3&d#?ua+<`p6&3~PbS+23N2Qb=jFeUB-RNM@m_ zDD_*ofhF+Kq2a72k^)aNn;tsRB%6~d(|9jKxbv7NhxTDkWMAdxVS$De&PSfn?;|Mr zd{x;|>ZZqQR|~NbG!JQ3Pbd<2X=@&%(YTi#bIX~mUr|6dkCgH)HrJ)2h#BfK2V3>v zomLvx=co{SRlPYZJ?@n%pG{I3n`>Ph-s`6;B;FY`TEYiCl@tp1Ru>2J&`1`l>pjTm zPXAqJTymLldN!7Zj`ls`voAr&{mL`=EBD({ao<;q|J};H_77L?5#OX1c6Rp=-#7vP zWp{78?Dr=A+THu%?%wvOe;=ovQ2q@AV0Q%nXVzE&K?U3<(9Rm`w&8ofzpt_WVhR0o zme861-`&$C0b~CHBX(0Pb|%(8+1KDT@3jeqay;1iWnhv_4>{aPj!L9|Q^qz8F9ELm+T^ z;y?F8B2fRx4~4`XjoMou6oFF@-)lp{5O{vu9((;DAQbM@+}^T(Z8&k|{Wch&oW9=%N8n@$6Y(~9tFW=-`i_LA_0cnJ{t;# z*gvLVkSJc8!5}D(`L?G%V5IP31O_8;0@Qo`z%bbUaRCFm>HGT&1`^QY*&sMp-oE;v zaQyZm!KnRX3PvDNcx^_)a2o4-+lPYS%|9>_h4;M#qX6vNR~7<-;>8ls$pTUa?7asB zP|U_F3x?ue%J#Gc0tVs52m%(x8v_Uwgf};Ua~rtDoW1oyftkPGhJ=Ig+6T-@y!Igx zxFf22>qEiu<^co+1PRX%i5E*K2&W>xw?3d7c>REaka*t(C>Vmv>F~WRE>3{nzqu0y zF7KtLm4`X-OaT%O*xS2sZ6`a>0JO=`7WP1*gg+UrN@oFCdI^-dg&EWYghYZ(QD9>j v0s(?qK%sCGQv?zYH5WCJr1-hZpNDXrU4RGX_R|ashN8d}yu5O%@)Z9E4yjLZ literal 0 HcmV?d00001 From 269131b833c0418fa990385ce1dda01e4cb79810 Mon Sep 17 00:00:00 2001 From: pooja holkar <37286638+PoojaHolkar@users.noreply.github.com> Date: Sun, 24 Nov 2024 20:40:56 +0530 Subject: [PATCH 13/38] PII_redactor code example Signed-off-by: Pooja Holkar --- ...un_your_first_PII_redactor_transform.ipynb | 316 ++++++++++++++++++ 1 file changed, 316 insertions(+) create mode 100644 examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb diff --git a/examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb b/examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb new file mode 100644 index 000000000..13cdae5f8 --- /dev/null +++ b/examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb @@ -0,0 +1,316 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Extracting Text from PDF and Configuring PII Redactor" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What is a PII Redactor?\n", + "A PII (Personally Identifiable Information) Redactor is a tool or system designed to identify and redact sensitive information in text data. PII includes details that can be used to identify an individual, such as:\n", + "\n", + "Names\n", + "Email addresses\n", + "Phone numbers\n", + "Physical or shipping addresses\n", + "Financial details (e.g., credit card numbers)\n", + "Use Case in This Project\n", + "In this project, the PII Redactor is applied to text extracted from invoices to ensure sensitive customer information is not exposed during processing, sharing, or storage.\n", + "\n", + "Workflow Overview\n", + "Text Extraction:\n", + "\n", + "The text from the invoice (a PDF document in this case) is extracted using the pdfplumber library.\n", + "Redactor Configuration:\n", + "\n", + "The system is configured to recognize specific PII entities relevant to invoices, such as:\n", + "Customer names\n", + "Email addresses\n", + "Phone numbers\n", + "Shipping addresses\n", + "PII Detection and Redaction:\n", + "\n", + "The redactor scans the extracted text and applies redaction rules, replacing sensitive details with placeholders.\n", + "Output:\n", + "\n", + "The redacted text is displayed alongside a summary of all identified PII entities for auditing purposes.\n", + "Why is PII Redaction Important?\n", + "Data Privacy Compliance: Adheres to regulations like GDPR, HIPAA, or CCPA that mandate safeguarding customer information.\n", + "Risk Mitigation: Prevents unauthorized access to or misuse of sensitive data.\n", + "Automation Benefits: Simplifies and accelerates the process of securing information in large-scale document handling.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "import pdfplumber\n", + "#from data_processing.transform.table_transform import AbstractTableTransform\n", + "#from data_processing.transform import AbstractTableTransform, TransformConfiguration\n", + "from pii_redactor_transform import PIIRedactorTransform\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Step 1: Extract Text from PDF" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "#pdf_path = \"/Users/poojaholkar/GSI/WATSONX/WATSONXDATA/DPK/data-prep-kit-dev/invoicedata/invoice_garminwatch.pdf\" # Replace with the path to your uploaded PDF\n", + "pdf_path=\"/Users/poojaholkar/Downloads/Invoice.pdf\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "ename": "SyntaxError", + "evalue": "invalid syntax (2155885561.py, line 3)", + "output_type": "error", + "traceback": [ + "\u001b[0;36m Cell \u001b[0;32mIn[8], line 3\u001b[0;36m\u001b[0m\n\u001b[0;31m pip install presidio_analyzer\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n" + ] + } + ], + "source": [ + "#pip install flair\n", + "#pip install spacy\n", + "#pip install presidio_anonymizer==2.2.355\n", + "#pip install numpy==1.26.4" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Found existing installation: numpy 1.26.4\n", + "Uninstalling numpy-1.26.4:\n", + " Successfully uninstalled numpy-1.26.4\n" + ] + } + ], + "source": [ + "!pip uninstall numpy --yes\n", + "#!pip install numpy==1.19.3\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Step 1: Extract Text from PDF\n", + "\n", + "This step uses the pdfplumber library to open and read a PDF file. The code processes each page of the PDF to extract text and concatenates it into a single string." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "with pdfplumber.open(pdf_path) as pdf:\n", + " text = \"\\n\".join(page.extract_text() for page in pdf.pages)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#Step 2: Configure the PII Redactor\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "This configuration defines the parameters for identifying and redacting Personally Identifiable Information (PII) in the extracted text." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "config = {\n", + " \"entities\": [\"PERSON\", \"EMAIL_ADDRESS\", \"PHONE_NUMBER\", \"LOCATION\"],\n", + " \"operator\": \"replace\",\n", + " \"transformed_contents\": \"redacted_contents\",\n", + " \"score_threshold\": 0.6\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Step 3: Initialize and Run the PII Redactor\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This step initializes the PII Redactor using the previously defined configuration and prepares it for processing the extracted text." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "20:33:16 INFO - Loading model from flair/ner-english-large\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2024-11-24 20:33:33,105 SequenceTagger predicts: Dictionary with 20 tags: , O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, , \n" + ] + } + ], + "source": [ + "\n", + "redactor = PIIRedactorTransform(config)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Step 4: Apply the Redactor to Text Data\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This step applies the initialized PII redactor to the extracted text, redacting sensitive information and providing details about the identified entities." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "redacted_text, detected_entities = redactor._redact_pii(text)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Step 5: Display the Redaction Results\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This step outputs the results of the redaction process, including the redacted text and the details of the detected PII entities.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Redacted Text:\n", + " INVOICE\n", + "Apple Inc.\n", + "Invoice Details:\n", + "Invoice Number: INV-2024-001\n", + "Invoice Date: November 15, 2024\n", + "Due Date: November 30, 2024\n", + "Billing Information:\n", + "Customer Name: \n", + "Address: 123 , Apt 45, , 62704\n", + "Email: \n", + "Phone: \n", + "Shipping Information:\n", + "Recipient Name: \n", + "Address: 123 , Apt 45, , 62704\n", + "Item Details:\n", + "Description Quantity Unit Price Total\n", + "MacBook Air (13-inch, M2) 1 $999.00 $999.00\n", + "AppleCare+ for MacBook Air 1 $199.00 $199.00\n", + "Subtotal: $1,198.00\n", + "Tax (8%): $95.84\n", + "Total Amount Due: $1,293.84\n", + "Payment Method: Credit Card (Visa)\n", + "Transaction ID: 9876543210ABCDE\n", + "Notes:\n", + "Thank you for your purchase!\n", + "For assistance, please contact our support team at or 1-800-MY-APPLE.\n", + "Detected Entities:\n", + " ['PERSON', 'LOCATION', 'LOCATION', 'LOCATION', 'EMAIL_ADDRESS', 'PERSON', 'LOCATION', 'LOCATION', 'LOCATION', 'EMAIL_ADDRESS', 'PHONE_NUMBER']\n" + ] + } + ], + "source": [ + "# Step 5: Print the Results\n", + "print(\"Redacted Text:\\n\", redacted_text)\n", + "print(\"Detected Entities:\\n\", detected_entities)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "data-prep-kit-1", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} From 299aba3c562c2f80e259e5654bc3d7dbbe49fe25 Mon Sep 17 00:00:00 2001 From: pooja holkar <37286638+PoojaHolkar@users.noreply.github.com> Date: Mon, 25 Nov 2024 17:32:50 +0530 Subject: [PATCH 14/38] invoice data Signed-off-by: Pooja Holkar --- examples/notebooks/PII/Invoice.pdf | Bin 0 -> 33150 bytes 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 examples/notebooks/PII/Invoice.pdf diff --git a/examples/notebooks/PII/Invoice.pdf b/examples/notebooks/PII/Invoice.pdf new file mode 100644 index 0000000000000000000000000000000000000000..7b372f7f291e713de14d5b63ffab811c7935b43f GIT binary patch literal 33150 zcmeFZby!u~`UXmOHwdyoy1Tney1Tm@rMp8)K)R&6q&uWVLZnN&K}zld_2@Z!?-RfK z+&|89k3O!o=9puS9`E?R_j{+L@*-lijC9O!q&*ua8@q+4S>ru@aLfP(fSsWw91jnG zUd+PU*~Ag}ZEfIeB4T1>XKVtXmoc$5b2bMsGcz&(`1s(QoE=RJY~b8M>$GfQKG&mu zE7v0`l&#)!HΜ*tR6+vNmdelII+ua|1g z6G|(Fk-K>8BXu=OE-&@k8(a4+m6pCK-sdjzUhNeh>?|DK-pk$H%kf<%)0gk~OgHdP z+u5~J?VhR{EE+S-H=8oCsHKn8-JBG+-Hnl1)SniwF>#lMV6hB?N7M+OFT_Xm%pH3= zD2>~Of0}DDb2u>G4_?;NmSJZ<1)0})_UU;_=CV#B91=YHkw&A(3zeWJ9Ha3aF5lqd zLkRiy__S>t$h+PO;fSVRXIov5&MFoXkj;M?BZ8oz+Zppl@xh~cV|WY^Qa-S|8O>`S z%(Lq+t}|7!+Y1@&w1IIam7H2+a+f0wPtSWJb$hPOb8QgvW;ne$8Twkf`@oy1ohSJT zQk>;XYScj?721N`!qjk00Bp`T75f>eOD6?9A+EKY19pOxTp>~+a~s$80qT)mE;{Qc zWhEBbYzb4_JtwWR<4BH$JxJ4Muw>e+L$VV6`UgAu$28N-!v%utb9h+Dz40?OY30R6 zgME`QpqiCSq_)ocvoaqzAe)R`cH0I->nm#lDmi^S-)@CDZ*ehVXuXc9V3h3h}qo2==E~>^9Z-0^B z(++J9T4vtuXCXe?94ssK5*ERxo*A{U>*Je2G05>PBGnNV+G%)25eOwyMtn;r{Vi^J zEpjI1^Ojm-NTeMEarp*R+h~Vz5@CW+W|peb6$h$nfvcrVOhTYb_;nSgnrdpf*G!>~ zc3o8j`emZdN}KQ*)G{;eCGL}JPprntzWp0N)x9TC=+_(*_W37T6iI^2V0-f>?HiX$ z*;XGiF|ciWTG3D;9a}bi$E!E@i0O9IhfH$9FVqZ%PBmLQb1v;oi!Xc8-h^?pS@LmXMPas(@lYF=+v&mPC9>q#L2~i)I;nQ(Vi3MuRMJ%C2vq6qQ5)fmqAxKhpVq=d?QhKAgOS0h!*cbmJ#SE@x^6C75%bdYb|%* z{5|;&I*BoE4Ei)a1SSf`^ElIeR=rpoAAVMsmcMSNOI#Rj;{>Ne>s&|6mPCz4KMyvK zYL*euS)Vm^{*fA^Fy4hgVkYiFKELO^GeK6#Hk7rD^dNuhJ@xfWd}8lrd*{k>afv)) z!;scCtr_+CmxO0R$G71EAL{)*%_OvURiE#Y9LaIIVi2f>S9Vf9&yb=BK4dAXgQ2e7 zj{25&<^5~}(fi25TXIdam+LKhR`eSMHtl)V=_rZUj#XMzIqDTW`GFk@rI!*TueM)= zrOY;ed!9wnCAxPg9YT>vJX4)K(Q45{W}=B@wQX6FE*8Wkf?qao1PL0-eNYT$)o0sN z-`=eOH%A1WMq7^NXH~vqis~L-BC@Kvb2fWIcYGvZ6d`;e97XGv{N%~Jz3P#^k=9{S z*;;mA5^~p5b87`;mcD5AdX-2SL`oaCS_e%_@eWMMTHpN3Jxm7NLyn`jc}yG^k%$;0%2bolaRY_kGN@y1n|&j^Mi1vKV|d7 zmSK?x;UVVr?oz$NDD7jPlRjLpeZA;}6Hz{lF$9%q$HlqZRpO+57WDDy;rO{ZRm*zW zB)k7vbz9mFl2;n;^C+c-`@52FFYlK2?T+S7W&`SJ2~L+j9<#b1n!XWK8nf$D6{ZIZ zmGici8t`@(mWkq|d%JkEXh(Hff6*@txT<`gG1+Du(?2Hrc1xkc;7w}M=(XAuv1hVv zRH@M>=j;A62052v51jl~4}4;7*0IQp33V<>a^A*;PRS&ns!;DO&Rzh5e@8C-v=p`- zEDlti1%D4wj$&LH*GZFTZD8l^bGaKo7o{(h)G0aQV%HMK1@2Q4jMH4!IY(kwWQzgpm7kpTWiwr$(3Rmn0|jK(`= zTcF0C^z~FGh-z*jTn-?9SVqx+{PNli;Y_8#Z?bYFRxc1c4B=@FLcUqPM9iD5r+(H- zyYJ;m*U=>X$^8sVA%wWJLZdZ$gbZV2D?5S$oG5M_2)I;HWyXwNWwVs_Emvl`=1hy< z${GzAKh5W-3OM&gZy1PGzc6IQ^LT|a=g!WrAj42ZNL*`yNmTG!O*pX0ga#o)JrLDx zQx78lItg+>9Rj8`k|=lRv+zo6nr2^D*E-WMk6_IUJ33mzzOH8(v5bSu^S~1gv(>>`7J&Ogc#LLAb`<7^Vc}-wS*lRBKVih5^l4?oMtkf>4H) zJBnR~#Kvcv2pX;IjQNnwOhv<|=F3^sA!K5aqOn2X3ODJ<6ZqL84U;b+^5Z-bF9ZIe z#7rPKPdkWv7ZL;Xl#hqX%F~xruOW83?>WXq3H{2DGmO&QJRK0U6hK3A*6{HzSwm#XDn}1R!9)%WGs$DG=^$E?BNpe!~35{Vr;b<5YKyLY@ z9dl}DXKSk)lI3~2Nz;!DJ)3$*eZdr;6fL|DLmiY)_n88NI^8Ugf!;Ze0A_hZsKhAq z5(_n4PVl8Cs>Geif#Us%&&p(JXOU3=I@KFr|0!-r$PXr2zKt+hSrGAv=}>i%z9_de zD9aX!(PEg13RfC9Ued3QhGuKvt41TBg+N-uX~SV8mW6>+FeG8#f(GhOr6T|)nL(Z{ zXB=4YY`BCroUqFf)nDCTu_RDpHIYIvQec#NLxb3Zm}oicOC>~^@!rB~nQSIByo%sj zbi*;@Jr6K1

_Umr$`JaqGCAAXzMm_oVwlNS!*QbkVfb1(%=qY)t@$1mmAm>^yna zlT7ufkU#9cHa0Fuydb8F8!4di(+(jas>)#(G_9M>qkOz6@zM{mj=%1;>P3!R6_Odp7mkkZ`mB9poWqScKGJr&jW*mY@qS8tMu(5IAqecZazr0fEDevw<-~6t5EGd2Tm#Zw zl4bT7eQ(-gWXRL02qxYzg4LCvG)pj??C_8-F_VjI2lAmjv%slt^JCSpAk%!Olysqx ziuqBg2pv{qS?aYBYi9^rQfUkd`6sg&;&!G${G(B!1S*#T&<_yk_NkR1xQwkk!bStn zZX|W%`X-+bg01g~y`fk&FAmh&r6Dj;kh8IAI3b_s@>63j-MhA@9r>-VG2%6id)8~FIFJ%ZOB%+wqi^>HU8$RJCbq^u93SAX z2X~3_H@8XI!yf1-$s3qGema`iIs=%1w-o{OiY88WE{;YfP5_qg3Bq=^&cORlfCsk= znBlpJv4w$Spb}mIe$t79DOc50)czwfyO!$AQ>QU~o=F*St{|0)7D zb=%6DlJZufujb-_&eynko~B)%%xU`GCytn1m;?$0PV^%ik2e|i23UepO)oCfJ;cyi zA|+EWBNJ9OY)&chT2F=bMd2oO^Wf`bvh{2{jZxedmW;?|`NdSymE1^(Z0*;i%VxQr z&?c?QlBpmYmlPj9e0?JW!hn&YH@robZ5_rWih+9&b^-07ZtE=V-f2x{6ARX}QOpaeEn zO63j#ASaL@wTo{~p9Wj&zUo;;Acv3G+p~58&9tJU{=$Bf>2EjG9?W@}^8VmPmAe3& z-|v|uCbNm?MnPeLw8-!<0#fpr@!ciL{aOd7#mHbc&6~^)?Bk#>Dhgi`U<1R%t8R7T99uR1Wu4~7SuJs(jH{W@5MNTtUsGQls1G~2ckVZ zg0Hv$TyUo}1L&fk1&sd@3TBRAaV%CIa*EJ2$`ef(4`D%h4h-;K;jmbgR{~e#WN&e( z0*bSR<$2GZIpe*9(FwTDf*uF3K;40Q1){zN`(S`L3sUbpSILbJBi{9XS;QUl4K#Mg z{EAc~79UbW7xv1NBM_P(y{>o|fdLrs2{Cn;C&wZ}vFaqCs6E zBF3RVa;MXgp(jxroDaUw)1?@~zIO6hSWO5ke`KO00Esn;G728d7R+5BY8Q)w-&;9n z;$Rey9&JO)s_0q47D;?E*?0`eU=nOa63UcXKV?ZvVjVJGvR0z_K`vu)M;r|SUAc0S zOn@FK@+*G%1PU??IVpTmp*dMjsU`sr0T1DPMZ~wZQ|iU9-=%BHwTgI2dC`l+bY-+UK=gpG-b65Qwwq-F%4O z?mZ*?(s~AKLTaM5L9v0qfwxgJqzKi;p#POw5jQ_FKC(G7=9Fc#X>4t5o~^<{_r8TP z@rufp$}^ET(KHb>F{W5X#qJ}<$Mj;GV!0^>OEOCX%Q?%Nk8~wbV})aG8Dr^sEcffT zAv=}t!d^69Tb_6C@b1LprQ)^W;o_;_CA0Lg;-=cB4yJCTZn5;&nrp(>L1=DhZq?~B zB&Hk;E1QzmL~v%8mG2kQPVnU|l<<}ry{%C<%ayGeMb~*nr^PPVpxvP4CL||KuTHO5 zp>UYZr;u5enb$7fuH`e-F%I~oA0e35`8GQ|yWJqeAa}*+S#ey+1XZqPE_WY?QN2U> zG2faNS!GzdqZ^f+=2n*1es z)d$JKrF4x^%2CA8TzCZRBkU?xJ`#5K*`>*41!QCqG( zHa!coymNPrH`K_|$Z_Hk{RxHfTXUN4-^9Bp0C+{i(2K$OCMfy;vsgHVO$gJ*(8gLHxD?gZ^@_9vvD)FaWJ z!U7Yz6?P5cfn`O|!Mt|Dp`&3bCEN38RB`uqcY~lc6crKfX_kl?m>dv~h{BS_auXXA z?-l(h{#^8>XtJnuvO0}M&AM2mW+b417mM%-=1US+$D+&01?iBry;Zoi13G5yq?)}? zj&&PF@5jnx^*S{isGdF{Bv?YZ4@2)N+K?$f(`$0ufS8GI4!L=?)knP*b*%H% zIj}l#LL|9IC(KB)ToO}q4NOk3ZrP<;@5FDtC?Pf>my)??qG-e-u=Z>12O&~}@JebO zW?u4E`fiq72H|%@79S5(&~pSP;$$eCG)$XcM}Lz!mxv@uBMs1iud%FNG!r+;98wrY z-8@@QUAKKtFnBt6gHnB=K3x8wv+Y13aKn;x?Fp{N!CrL^WDmM+ir{vz(!*3 z=)KtcD3d2kw&9CY3Rw!3ZTjXlhnsyuO+pGQt+bYUDEJUyIL5&2;6VYzsSFFTXuzU`@FFI?o>U+P(ycaynI=wY-9K&=cJ)o zZLdw$K+8cN*&@a`rk3e#b*of`rTE9foxB}CyqMI@)}5CP`{|$h=V#iNY&qrim;y7+hJV-phzbnS&Blpw{`&;$| zwyvmGeN}4?{dRZD=fW@1OSa{ibDO?Ga1e1Gul9c>Jj_7gnezzuD&8i)%e<%`*9_Op zp8h-?R-eaa@lxjH(zfr1bb?U=TK=?q&$Gyjl+>CbGwHFoQOq}KoSAOkCqjKe#}Q%J zAzW(CeD7-4C)3L-%NO7KC0FPkH#ru~x8B^UB$t?#bLwz=rF*sR2j4|aKW)xP(%bX4 z*seRYY2CK3xl39ri*AGT+Pc=cpUPUgvs5~67;zUWNx7$stnc-m1k=4X*3cu9i+cwVY z=?j}J<1gzU+$;_Y3d0vm+W0U0?y_%;&idoF_vF6HC1ytPyLrOip170Qm~Jjz2AQM)6-FkOzmKStsn5Uq%!1-IqZUfS z;QDrAhmKo7->usQkPI0DGRI@n39gb&wRf*Jq1fOV*UEjNZ26R>+N9~`Z2JSP&$8Mr zI6K+VGG2M)n&!#f+sbRP~cps=K}X-AJ^8WabDms+;43Ffjeyoy9OEg3ky0(l9g=$cqCyse!mzDD zgMi*L%%kt>dYkVDzM>*lz(Kuo_{KBguNHzW0e{9=Af3}Y$R=1_6JZ-DCKu5}6j3Yh zIF9S;KT7kVi;7ya1jMAF10E$Ij~U({41y>lFgXm*!{!y4DsKj#q!oXWU^kUw1$3iv zj|fO)8|(1=eXK|qeONIGp{tx2B@vMzL=KcRqkW*UZWJ~v+#m$7?}%3 zcd(9^#Yu7C-s0yXPKcaEX202b!ibJROvj6i`fx%airB^@f%f-xj_~wyn0j8a&3+-9OT)rC!Lz-;a_OCy_D= zUrvj6?1d)Pgw^SNx;*f?K=@PYIBG+(keTpk8*O$jVvIz(?w2#~G0>D%$vBcqb3mLl z^YlzG-xtYn7$<-BjHX~kVZrc1#YPF~9ub{w;tApinvWGZCle8;InEfeTO*8#1(8`^ zgM{zIO(z(ZI}L5OA`8h+YuWiYqr4bsiE2)y8n2@_x-WGS?eO`vB;pYS%`F6yU50lY zjG&7y?c3*dW*FV&p(nXJzwlck^=j?8qFTP@Z|pdpa6v4%2zpJxXhzX9d zJ;KBUjG%vSG%y^D4g%l|LOZr+*}n9*7WjAa(RW|F}HTQ1sFk z#!f(T=zFUkM;>@q{l+CR|9)J3pA2CGX9H_Hv+tA^kbe4y;d^Y`&*2jUF#nhwAv;H7 zpm;zFnET-`unpf!dz=LTy^@Qe^CM3tV_|Fcz>FCIIV&bsHacb|04w7I;RPHM209i1 zD-$yvko@{Z6g^h(v%Vi?8aZ3o*#Zj%(5om)e*XX-rhh-pj|mUYQP>y&g$;IGPA>NL zc8<;h2KM&WCUi!2Hk5xOoF0bfXBC0~mLGHYf9Vn%D;+BnfQ6a!p+}6IoOEm)02Vfm z-#heJ{Lk9{Nr(P%;8`9Le(#W_ow+TYv7L#4iMxT#KlbU5o9vPG`!x~DjxHwOYkx3g zzTXsl_#;d$=xAYJO)F$)ZTwvMJ6kAX;$-A#Vef3`2*>hs9owncT0Ck*fSl#yb`-O4 zbaECpH*kDd>+~`PKQA}|^lBEy&gKsbk%^rHzz*CEY;5ds?7x0LrvJV!AItnXNIx_v zOl-iC{>-HMo#)eu8?o$WKnXm25zbo>qCa<)tB*iRx-9JTb{@JizYcY(I(8#qsqGo* z^Xw2#&t^Oe*TnSQtxu@}bIqB`-F`2O>b5*ITMT#H8C2IHEpoe}6rV?xq2msZa6vz{ znwL{f@~61*H9_*uABoT4@(!(Iub6wnJ-)g7IZvvF<8tKLynv!Irp1B7g7mKLd>p0p zv`yIryTL-&x!bCfi;M(>xOY?@*+`8Q=FKz`jTB|ciDMCj)?J0mJG^(_T&o94n52g8 zj7W!aK!gOM8^MJXpGKoz3cxJzt%AaUx}nAhz_^FHg9#~OF_B(*F$aggZ$>V`I%G8a zWUncRKce2|ctqN9cW=yPu=_eexeB{@3h&pKVzK8z9sA=ZnGG!0ds} ze&mpafYddd7A+ePk7=3Nfut`31CStQU}XdDen}vRSQrV~npv9wR|mbIlhFe)&dI>~ zK;eGBpk;a>ZiNl(B}^>L%$?!bfxI!jlCz18>cbNS?!UV_i6Jmev^2daqo&Ub17y%p{z`p!gUu?h~_Al30 znc{*ph5`n^nd!k^co4U(rNxA`7FnpsL4A(;CiI2kL_jE(wTz{0?d~U2Lp}+4LBH`2 zeQ3&x5U3CuuNpC7cro(iuy~Ct0TkiLLP2b!i~FA~>U^@8oyPikx7-e{7E;<~6Y1=` zSYtuCh{fn<{lSf;n!DA;%+?q|WCGtsjVz0u%SQ$vd;JHG0M` zzLqy+ed$A$XDT}j`)WZM3>zP$s)49Wt+3#d|24lGTg64P zQ^cqBcb}k?Pg{iN>je@} zC-mml-~`>0wYFwjCRmCw;u9E^b>S>HlDu!6s)eCPjTtCL2O~_EXd_aEK{yqu1AcBu zF+Ri;&nG@b347^e^0HinX+brzUR*I#Kgc-*45bd`-jaY#uozus^&Tnd)|bj7Dql(v zC|nww{p9FRHN4@WCt|cRT~$x)dgZPOeXcns?`5bsYpO%tj#w`AZw-z_o%Z_B*smr} zbR5L+1vCsK2Oy6uGQjY9bsh3YH4_vqbS|pa@qB8A=*tw(83g;@rhelno(`^xBporG zAQvocNf@kRmXJiYgq_M~b3U!uR!FL)TF0%?@U~)|pcUYD&;re*HK4%_bF(vHXoR}a5J5OxKe=H@yr_4g}HZYk$nns^`{u3UcgdP;VOsZ)+Q#~fR>WS4Pzq;dHtD`yt3jT=&tU@QEJbhkO z3!iHzj2eqbcRINy#W}Rwi&o!1Zt8~v_~)|zwWohOa{m(PMgIlTf47|Im46}qce@0* zN&Xz^fw2E~NdI?;{`-ma6VZWTqJO#LbYeSgy8$R7hn~GpB41Y9 zB+xt)p|xBc?*77v+6`i#Mn(i#^&Or<3LP=-03L{3jUw zhiUxp!RYUX{4ZmV<#%)RCq}cd19AE1iUA%o|8m7pQ&K%QS=Kk#*EiLd7=2Eo4+7o? z0U%AxP)dQ?XaKP&i=67gfXQ4zj%9lf0w=KX+_zRr%vKUIfe_H9TcAxy*U*+q`3?$D25O)bXAF_r3vot(J}W~81q$T_6^n!5J8Nw=T0HhL31iIe zG*S2!I8Du$swkRJ2e#$nuN8Ve-{2VRF+k59y3WDQtLE=cjqI`Xzn<@^C4zu!LzNOr zFjRpCy{75Z7JEyB4E_WhSzZKnAf!nsK@cMkug3&ZNS~$>9215Lt{4nf3<@ilA8r`> z8%HJ6t*N{|Cx|j6L?nWc{GByI7Xm^wV7Yi-Jt5_F?ruV6{OS`hWeAX`?^!wjxXc~{ zy#F`Y5dRmj@xvqd1skm2ZS(&VY&<^uvz(U; z-w@-6lld2h2FLG)2GErRu(ATZ4<;tK$8Vrp`QT{vaAhCr`G1YVoY{G`&jRw@s_N2K6=z*hspeFhow-XEeX4 zoZ>Lxomo^rQ(VXcA^DGOplT~#Z!sM(Y4VLAv>`{(5E|iZO+Nf%RHKN)9S6GTL zjc^b)bmg6%bJ0}g{zhMRz{>q8f_zH=sbCL@>A=Lqa|K|cDXf?_q}Zp>3QPOVGU9Uo zx5B{d5jJeKa%F{NbO%fpe#%vN_Z>4OR zAkXV&=qmH)jWzH`#bYsi1CbI(0RQ9eecWh&+`a$gkobMm{^R)o0J{EctU%WvsPFpW z`ZIF^b!-2*>ks%-*B|bIM*0iaA4sD8rRxu1{Z2^zc0T+rnfqy00%^xz2Lubl1E%~d za;Z!`Oc;G;y4}t6Rihyo$ubxY80>)%iG!#^CnAKheK%yHmx3T=HN_1vAV*FTOG5go<^=L|3Fhuuk4dtnW`41qsNx)_%*a=C%4Jnazdbb z;Izos3h~<*2VVtD9wY+9}C zPCK`oQRx%ltq{&XCtL_PUzWHJ1zGd1(Q>wFx%X)?%wmP2EqK=%BWTD1H(^HTbEZ4{ z>0&^x*2Sbf%@+XEx!|`%=!~DHl8F&o^7VvG z6tdrXG)AI!Ju}4@(q;|!ZwfCGDrcB+nizoY6UIR|_}Ya762N!ryMk;4s?)vr?kSsj zdPl>1z>#Q>KQ#r480_cd&gN5mA}`gM_+gH zV2Y8o`Bxt$mNUIp3R0nn=L=v=9|f3ky-7p3+1F2py9C!i15dYaNb-tBQQ#kn!%N9Q zHHT5`cmmIYkO32mju)*g5ia)ZNo73_qCm7c1zw~Wu~m;?9;^aAiBM}NId2?pwYcj5 zz6LRAwJ4Ku&~+}~TY5(Uv@Bi}9#gP4&@N9@;FQI)iU!bWMLpi)*BRleu7PNWRWP8L z5}^<97`?1{YK^A#-g~UR#*QOtckKDvYLi`4CZuJjrC{6glg=mE`Sw)|vaiW`=t{BM zP*TFRMRr9za$FU}kws%9d2jum2!jiQ_yF=@FcJqyFYG(B1h(`qgtjn1cLV~O;HStG zzHqW1z$mPx1NrZ(>Ow>Z#q%pJAN0KYn^$2%fAPUr;+`oc5Bp0Q>@95+>6P18E@`l;m z2B8DAYZIZuiH-u=Qn$yO7uD6zG8LXQxhF%Utud#mbuEUW+!N1QA=bwYC2GzVO>o43 zqPmv}@{Q|J`1`zhtMp!dHu>KF&-^YqR%Sg@*W(_qU}kTr*)-bk-KvtPIv_ z$80qv4Vl|mYOMG}CvJK5NBb{^2Yve)#xpP!7A z%(%Aj$?_5U_t*mc_NI&+J5en&$4sF8MgYO;_JCKZ5glGqnNE|fE0iyE4sx_ZhW3YT zy45l)AhGRZqCD){n3qY8>)#Yqd(~ycn+%s&qOFO4d*_{>bzw%84t*5s?^OzbN5!I4 z@dSYp#mlml8mXuhMXSP6WwQPja~EjcPx*eOrrT4;OO?4CQSe1K_RQqdxch@K@0@14 z@fWt4*e-8r_{zhz^wGFJ8P&dzcxj~?*X}cW;R&A+fo{Li8{Rj1^S&hM4egwglJFgb z2c$Qbl#NKPMyPtWL^dPq$LpFxr@_;vRBspSq0G6@&^iaUhEiOkiyv3azDhBUbV6nc z`;2#j#hS99C{d;N%lq*=QTUUaZdS(}U{`1E`U}s5@bmesXsC;rKE2w#usKod+=;AQ zKHKZaDm|`oqmd{d&$s3sLc*^mKU+=@fd-EVMz9;Q{33*-H_#r64ON(5t{tB;J-1QS zeJY_Sg^9szRIiXyTBH-T=w$Gw|Lt2r12I5GQQ#Oa>E<)tA?XHPa_J+z*f05=sTI#g zoed{8ZnusdDu#*=8E8Y;`=(XNNRn0|R~scR%NnLHtIiO+d2M-nR=w7-Pf~BiWC{4k zqOVS5lpN^lr~-C6@NnM%UoA29g`|n(>Hv3@)Y95gc#1uZYNybgMKB@~0YEPDobu-3 zQ8)is8jCGc!szm}dzR9gQ_RwKs0e!A;-e5zVI}a*)=t|p%2(VDSWHE)k;N0y3`O@v zFoW%x3>vLx6c0%;i8d;7u@WO|ck5ld@uuo-lh%G^^v5h_d=)4}qjz5>bcw4Xbn zNBq`w2USewvC?)WUbTPRO!8$;OHt9oPoYR4?Un9~&{H1Er^P!l z3pJNA(pIp;RL%>>c0JCxARm$iGQzWp0gwqukkn=)>!pgM7*&ziTs2?6F?J{CHyoF= z+SUh5`ClX3Hu}k!C9QV7qd#%%=nF#Wmh=Qj#`%5v~2kW*+*P_?zm+%9X4aL&}7hcUDZ)y76=Esg&5 z&W>m?)Es*p-MJK=8PTyC4(Oeas!pGxwl!OnKPbx++NCq@I(%Eq{{h1E6W;< zqW+WSVndSV8JO^TahGL>L{F4+y z-;+!(Dk9&Q4yeV8^(DjEtd3W!)HF&y`SCD2+^?F-2tPOGMU&+Aa)9sc!!JZm7)~$4 ziqR$eo|`{kjp{|1+eU438iYUL9qc+rBz?a?@Ij(JYq5uAS*B;ENB^y*QWEn|*MKKfxe~B&1wMg~y4asg{zJ_9-GVG~s1AdlX+>gxJ6qUswKi z&}sP4pt)e{mRnkcPjz)eS_5n9;WtCA7_y5FWf?DJJgxWSiuF2 zG4KJt#B@R08g2=aBecf8&@9)W@_Py1dgnS3S;Rr!<&bo}J+e!olAgd^+)U!JY}OB}-ekmebuVQVrph z;UK8lwmj=A1fb9AE<>?p-9fFF4h!W5g*mW#UV_WwTNrVJZRxTJFgSmf(IkBG9?>rg zuu|~3~)VYxl4wmKY!?|H;C=sDRA+YMJ4Jqkt$IpY%tnBeYKU6 zc1ECYwnXjNlSlf5hJAfFeU;zLc#n`^JHBuA1}UHi>1nQMndT>vzTWQ4N>kNj8Z&mvZzum-=DmFweX7rg%x|tFBsIt^K((&x)Vx z#alIYyIc6$@kG zKrDJqOM~RowCaxCd7oQ`Bq+Il_M_4iYdLoJPNv*npHeK%e|Pd)`egtDFKqqCS##DG zbygGB2ixbnhijLq!;6DgF8J0b>hRSCV1%@Ce#xvVOHu1t0rIeH%12xhMTCI)6+`8~ z%YiMbL@ol@D9Y-&No_?Jsg=h8H0f`0Y2b*`Y;I zQRko3^cxp!O8Ps*r$-8(fV)qMk)@cV9aK=Aqg_rSBc@?`B`9Hi99BPENiYvJWg>p2 z?N*Ox9`y#vp%{$IYo!FUt)Uh`RlDn%<8`pQ9a7!i8ewwWzQsc_wrk1EY$=y-HqB(J zUdJ2E!&G~h;a}#kG@UfPyHwAj#BclxMHb2V`PCuV*MwzSPzh=N^#m%|+WJg;qSNuO z>75!gLhQVeda3-1jwECkvlCGhQ>?7@L)*}0)N-sVC)U~3XR^g zuf}=y#dxp=NN*(<&OO3>WwK0S@kBYB>h)Ybg-UDg%HO}_)#*o~RgZPIEm{}J{1O`| zpJF}LQF|)dMwEz10%W4CDvon}z88bGskGcrR3rIB)8t<+5R zm6o^iE#7+7?u!uw%)NYEYdgU?igXKN>DpU%oO{fUQ0PWEZ2~?X+Wo2R52oGtrP9cu zI$%SuS)0@`WKSFsn|XqjVnvFw=EugL3qu}?v3~I)f&~dH8oUCSk(1X}P+VeMPG2U8 zym)DsnMFT67JlR$ZR(0AyTLhnrW>2t8yTPs`a&SV3+ybi?tSDm9W!fRzvNf;`?`iI z{0ko!l2=N2lm64Rkp|U~X3IXBvugHY2yLC}h)`4Jb@qk&D&tE9ajh{!;z)S5K09QT z9h{ekk>ql?&@zi-_`E6$g4a0TqPA#=-mkRNQ}`(1oRQ}dR8ZfIJl({@TqQiHI`ZLx z9+mndh4T3R+aD>Ee}@2-{%ZpCyH?@1NGp&l{HL0l-wDtMGVus=rPfzcASUqTorAangMxv>9_f~+WnT?NF0Ff0+U zU%1ICQ)Cdp5Uvpbj5r`wbE9v(aJqE~A7iYkk1^JFSdTH**@|he_Np&^KBJ4W|IRVo(cZ3iVeGXevie+@>k{iAyJG2tgR{$$MmhY|liMtJ_hh*rige_}`cz5#zNtbfOjko!xF zVEdt5{)G{2KSHbj7mRp(_V3pKFo5}Qgh~E?rr3HM^`GPPYbyT6-R z?1;a#BY@WRf9PxcK7;=~81db0{S71j=!^ajBmNVN0Q-dzD~x-8f)T%Oz+Z>Yzk?A< ze~l5}UB=(|2;gf>|BR2|1j;e~=4<>EQvo$ja1T1C2NBhS@aa(i^;4MjAclGnP5pNL zkoU*CK>#+OKK0?52SL`OP6}Ao!}srkte?Msim`rP|6b2yTYws?hdhtve6RPrAPZO@ z2M6ch^jAN{Uk~|ymibWD!}UWsk7WSc^(f%_S=jGyc0CT#PrUvChO8VM|Js+77W zKz^3QomM%;KTEJR)}Imv>(ut%O8#DFf$`8y=w-TDIe2sEUcorxIP&&gd1{{|G>#95 z5c8i2y@8AOKRkJU$A$l%ljl1b^4rPtyMy)9gJtCazG3k5T>{{Hng52EvP|7WTdBVx zv+0rylpRgT|A`O`2}OAzilBl+CuUF%OekC<2&gs)Dx`?3yvRg2H}Dcq0V^?-Fj3P+ z)Kbf;>}}nrLY14T>2KC$D@cbj9n)hso{K*8SHn7{qi;B!k`6hY;?bsQgt5$@8KP>U z5mv#a_zbf&Vv6^cX)T_Qj*3d);kMEkTFh|o_%@5PI6Jo*xixZ&_LJv3w?lW|H8dtl zzw1vST=w$v6p>>0aVA_OP!;K+-_1$R%stc@yN^*rXz9z*Ct)}>j>U-*s6FhC;e)rU z<%L5xp{d8anKtwd^2l2?J4d-5-}KS?)|M25_ACO#z3$eGZWul4SsT>Oso`9YzGihzTfcNFZ`NO`?dsM!r0K%0?hBSC;0fy zQOCiyxyRWwW^)t$vt)Hq0;M2l#vRQ~jEFs^S2+3g}Tr%Jg z!g_5SS;Ibuac0iiaDch$Ucmk&^O`>BC0|=GTvj{C!CVJ*=7nII%pt&uE;0&QliYnQ z*o@j7ZkG9AT8N%RI zeiG4B<|w7leVD!?i#C0Ed4_b^om)wd)DvM3%NKmebcaiT-i`BGRkaVQ=EVi3SudnL z=V#pVsF%a7tL}aQ9l|Jd98_PQ55Rov3hS~rXg#VLkYbTdAo)bg6#hwMOys`GzUs9} zoS}k^!^q^2qE*v4(}xq5(qZ|2H2DT2LC3sU3X)O#PJnv|_{$d_K4^C#c!@;Ec3xP~fI)rr`0hsMpph%%{CB+L!i6VBWf4ejFG>U1T z=58o2D+m&P=N!80)eT}^Qm?Ub?|%?#Dl#O=0|{B#6K&z{ zrh6LNiS1yrki=lWkZ$#+gYpuYg$b2_h|3RPph3ldbhkoWlu~W)nJ@X$8eR%=jd${F zRLlm2Py(C``dJAtkF(jQ+pT7g7nL?PCF~`)*P=8aQiPt-6U@gLEBTs1!y&klW2|hG ztWoxZ&`%kJJaFba`f*q)PO)W)8zHVM18@zkENPlpts(aERV90;G|{T0wkgm?BU)W3 zn&>QYzNuch?MkH;ANbsbf5BJPnBiQHF2u*Y#}|Pdt5f0AJ)wS2MTR`> z4Kr%zuOi?bq)3brUJGaoW@P>PIkkEBIFy9WMLYFwuxJJwd|y3~19e=;nEO2j6@h#z zfyTD^qDwNfUuPh(>{PaqbvpHBTluH>)H*_T$t5qR-7$-PO{%W{r@bqWhjRP>vSukn z_N{E8nAOZ6OR^=|MJT&$V^{VqWV^0COXSLyvLyQ!$`X+!O9>HSq_Gtdzh~U~S>|@T z_x|^Ny?S2$nUBvo&pGFF=Kaib&ilMSq|8*3rilix946Zq2ieyVO7?mBn@c5 zVYA0`mjw7j;0DHBV(#S%jU2995yWC}A^KGCo{gvbP z=^8xK)YbGvN0tfZGH+!(WeJtfjw?FS5x0oy@MphRC`xncve%8%z2J$10+vV%eoo4V zn&AS(3VA+~3GBoP0o27t(mW-!-3}?&WXkx3^BdG)^{rV|w<_DyZ^vpnk?Zr=t#vU< zjGP#Gw07?4NEK>gvg0Gaq7&2FQi=_)tCaf8hS4GB>6XJqWX80s2v>rPR-y|1KzTp$ zn47Lxz82A~S=H9$+4aM>IauF8MPe4LezP7M@VUKob7R22(K48<27O^+?ncwmJd91n zW63YkKHM#}RrNfn8!I2O6Irq6Q-kwGE|$7T=<3or71&<0J5{Pld{s#pDero=mO18z z+TG=%vVOUuNV#{D=c)7gL@`lk=4&&q@)z@EJAbBHC8;Y*H*w^e%ix@AU~p?jxV~_?$VYp)H$3t3ljisK$9zX>l{Zg@c7#WEo-*Lj9)o-98(G93 z6PjQWH^QR7W_$_>N%+Vjv3yR(nZI$hr@#F=+U3Hhao#l)`)!}sQ*3W;X=}=@oU>Cf zXF0p_%DmPwqTf*ED8)76aXsaZ;+)3$J4@&a1M}Z=7j(_3R;pt#w8*iq97daAK zcLlQ-RT?)Ssx4pa0!d$61s-vVqLY118RN4qI1gJVImMp0_EkYcaYQA#+U0az z1bYDL0iIkDEY4B?WF$6U1O%amazr=9+arZ+7bYxX|KbIaZDL6IybF<57DA`>Iq_Z54;@lCALr9{TK zFShFx;M)>+OKEOu()I}|lB1N{1aJ z_FtYgWXt|>In&fzN)JIXe%1ZGNPtE_$sKcZq-1#1zTC7HY-BEWKOH zw!r2?UTB|~>kBK8{Y)ND{_Y4XsoM_F@ARJueF#teoVm#IU= zi3$iSs6knpd6ngBQPJ$u_EQ1+J8tj7|ux=4dap44f9JROXmC@Hw-I;iP!c z3-~D1wSkwd--l5Iewk0<6>D45DQ<@OmNeIjmcvdV1mLxEr?WYyk5kbsUc_oi+vcHZ zh=hkq1qkTFM(H0vpDbWzwVU)4h9%@y+$$=nW%P}%d&u`f%9bjW@@D8O>{Rcw`Xyp{ zA=#BYr>^c*G`4xgJ)3BtmD3Z;z7Te5iJ$L6+L5eBkq$4lh4tOttvVt*BvZnnj2+!q zO$?i1e1(-sx0cz=2=$QM(>3{3Ari##$#H{@DXj!D_JMD|a!P9+rky$n6CGn<2|Xk$ z^LuxIK($8L4N7~xbLU6B!|?BkXVJ-)OR?m8>0(AN%j3a9G>U-`klbRAXu4d`L%f zFxrESiO5AHvAZNlgCwbf`VJ|{Auf{^VY{9IQB{@H>`1$)#$*8!X3X4$6~hHrW0}bM z!JhtN_d9m);^U3c3*hdfg4J{GbuG@8mYroYz18)x&QV93YGidj8`io{-MGozr0D~l z^oVMvHXsc%+>k#{c?Hzsc`6=4efO@qNr6^ISfAo!=CLQ-!OzY=e>JF~)R822Glbnn z5y`Gs?Lp}7oPkPkOurc8HQt*1j*gIf%MFwIV1Ta8X7TF*0}scMsj~jU!ZQEO4e|71 z5g9VI!~<)WOM4UX z;R#|3h)p}$lV)!>%zDB}!EzA?pMy@C8wwPsiZ&6kjeH%bVtog<<%dGdsU9k!j+K<(4b9lT+?zqUN7~JgHhVj?lyJCKDu=DQ@!c7izSH=~n$1jq$nOI< z+|!tGk$X*lPR%BwS$v64jx$S=FX)5qjloA*t(kq+)8zEF1lMIU)#cj9?#LdVR&Ii)|r`5hlnd6i>B`4#-gDS`}Qs0a5a4 z3RZ68w-mh}x-tr3u>EAQ-q^Ih_Vm2FkF}Abm~WbODFIW%paPuALGbF;t7AU3?!7z{ zTG?*87Ck**%S#u)$+ezhm zb#uY<1_Vx&+IMWS^w3Gn5w>~DhK(tu*agvNZ+@GNy*6ZDaal*Rkczc@wl{bzs{~Su zeAzh^&K($8YIz;*%!Mu<=xo{$&KBl<>T6H#TdpsqYRvBbS?rCYrNx_4MW3h1WN#-o z$2ktkn&j|uOlVbG=iFa(R(?;HRC9De^0hq{#;Hm=sz#JPf>j~ubbDkFlIcHGDOKRX z&ndjI!I6-&S+#b|eMl=Y{o=*+lq)TY#Z|~8iSgGy&l>|BKC(VQ)UDO5h`scd41G)g zc;J;6slTZ*<4M_Mf{%gJv}*EObm40uwPgy7UJtG;f#5C}hqmr_aT6^uM^s~(8iG^~ z9FV(REfi|CbW&vG2?x5m$D?}X3;XP`%AyW!U75wg?CKh-&Y76uFh}qN<4G2T7XT(Gm zmI$6#QSKJ|SW#v9Ti5h4UjG@!GrTz-yv=uKCKqXmoP_5o!hi!DOTwQxm{N*tAC^BX zEP1`$F{C+AIaqx`IdXPgX^>xRQd-=Knm_q>#it=aI4)YLD{@_G5~cNPQ39oq$O@yhAaE0obMh;IaI7@Zj^Do?-T z6f!&-$)jc&IMF1Z`|R{IR9y1Ir$3&d#?ua+<`p6&3~PbS+23N2Qb=jFeUB-RNM@m_ zDD_*ofhF+Kq2a72k^)aNn;tsRB%6~d(|9jKxbv7NhxTDkWMAdxVS$De&PSfn?;|Mr zd{x;|>ZZqQR|~NbG!JQ3Pbd<2X=@&%(YTi#bIX~mUr|6dkCgH)HrJ)2h#BfK2V3>v zomLvx=co{SRlPYZJ?@n%pG{I3n`>Ph-s`6;B;FY`TEYiCl@tp1Ru>2J&`1`l>pjTm zPXAqJTymLldN!7Zj`ls`voAr&{mL`=EBD({ao<;q|J};H_77L?5#OX1c6Rp=-#7vP zWp{78?Dr=A+THu%?%wvOe;=ovQ2q@AV0Q%nXVzE&K?U3<(9Rm`w&8ofzpt_WVhR0o zme861-`&$C0b~CHBX(0Pb|%(8+1KDT@3jeqay;1iWnhv_4>{aPj!L9|Q^qz8F9ELm+T^ z;y?F8B2fRx4~4`XjoMou6oFF@-)lp{5O{vu9((;DAQbM@+}^T(Z8&k|{Wch&oW9=%N8n@$6Y(~9tFW=-`i_LA_0cnJ{t;# z*gvLVkSJc8!5}D(`L?G%V5IP31O_8;0@Qo`z%bbUaRCFm>HGT&1`^QY*&sMp-oE;v zaQyZm!KnRX3PvDNcx^_)a2o4-+lPYS%|9>_h4;M#qX6vNR~7<-;>8ls$pTUa?7asB zP|U_F3x?ue%J#Gc0tVs52m%(x8v_Uwgf};Ua~rtDoW1oyftkPGhJ=Ig+6T-@y!Igx zxFf22>qEiu<^co+1PRX%i5E*K2&W>xw?3d7c>REaka*t(C>Vmv>F~WRE>3{nzqu0y zF7KtLm4`X-OaT%O*xS2sZ6`a>0JO=`7WP1*gg+UrN@oFCdI^-dg&EWYghYZ(QD9>j v0s(?qK%sCGQv?zYH5WCJr1-hZpNDXrU4RGX_R|ashN8d}yu5O%@)Z9E4yjLZ literal 0 HcmV?d00001 From cf5fca87d97405f70a2629a6cb8340a462c8ddaf Mon Sep 17 00:00:00 2001 From: pooja holkar <37286638+PoojaHolkar@users.noreply.github.com> Date: Mon, 25 Nov 2024 17:33:47 +0530 Subject: [PATCH 15/38] upload data Signed-off-by: Pooja Holkar --- examples/notebooks/PII/invoicedata/test.py | 1 + 1 file changed, 1 insertion(+) create mode 100644 examples/notebooks/PII/invoicedata/test.py diff --git a/examples/notebooks/PII/invoicedata/test.py b/examples/notebooks/PII/invoicedata/test.py new file mode 100644 index 000000000..8b1378917 --- /dev/null +++ b/examples/notebooks/PII/invoicedata/test.py @@ -0,0 +1 @@ + From 96bc729f4bc83a6f3da0eed5a9becde15a5009e2 Mon Sep 17 00:00:00 2001 From: pooja holkar <37286638+PoojaHolkar@users.noreply.github.com> Date: Mon, 25 Nov 2024 17:34:03 +0530 Subject: [PATCH 16/38] upload data Signed-off-by: Pooja Holkar --- examples/notebooks/PII/invoicedata/Invoice.pdf | Bin 0 -> 33150 bytes 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 examples/notebooks/PII/invoicedata/Invoice.pdf diff --git a/examples/notebooks/PII/invoicedata/Invoice.pdf b/examples/notebooks/PII/invoicedata/Invoice.pdf new file mode 100644 index 0000000000000000000000000000000000000000..7b372f7f291e713de14d5b63ffab811c7935b43f GIT binary patch literal 33150 zcmeFZby!u~`UXmOHwdyoy1Tney1Tm@rMp8)K)R&6q&uWVLZnN&K}zld_2@Z!?-RfK z+&|89k3O!o=9puS9`E?R_j{+L@*-lijC9O!q&*ua8@q+4S>ru@aLfP(fSsWw91jnG zUd+PU*~Ag}ZEfIeB4T1>XKVtXmoc$5b2bMsGcz&(`1s(QoE=RJY~b8M>$GfQKG&mu zE7v0`l&#)!HΜ*tR6+vNmdelII+ua|1g z6G|(Fk-K>8BXu=OE-&@k8(a4+m6pCK-sdjzUhNeh>?|DK-pk$H%kf<%)0gk~OgHdP z+u5~J?VhR{EE+S-H=8oCsHKn8-JBG+-Hnl1)SniwF>#lMV6hB?N7M+OFT_Xm%pH3= zD2>~Of0}DDb2u>G4_?;NmSJZ<1)0})_UU;_=CV#B91=YHkw&A(3zeWJ9Ha3aF5lqd zLkRiy__S>t$h+PO;fSVRXIov5&MFoXkj;M?BZ8oz+Zppl@xh~cV|WY^Qa-S|8O>`S z%(Lq+t}|7!+Y1@&w1IIam7H2+a+f0wPtSWJb$hPOb8QgvW;ne$8Twkf`@oy1ohSJT zQk>;XYScj?721N`!qjk00Bp`T75f>eOD6?9A+EKY19pOxTp>~+a~s$80qT)mE;{Qc zWhEBbYzb4_JtwWR<4BH$JxJ4Muw>e+L$VV6`UgAu$28N-!v%utb9h+Dz40?OY30R6 zgME`QpqiCSq_)ocvoaqzAe)R`cH0I->nm#lDmi^S-)@CDZ*ehVXuXc9V3h3h}qo2==E~>^9Z-0^B z(++J9T4vtuXCXe?94ssK5*ERxo*A{U>*Je2G05>PBGnNV+G%)25eOwyMtn;r{Vi^J zEpjI1^Ojm-NTeMEarp*R+h~Vz5@CW+W|peb6$h$nfvcrVOhTYb_;nSgnrdpf*G!>~ zc3o8j`emZdN}KQ*)G{;eCGL}JPprntzWp0N)x9TC=+_(*_W37T6iI^2V0-f>?HiX$ z*;XGiF|ciWTG3D;9a}bi$E!E@i0O9IhfH$9FVqZ%PBmLQb1v;oi!Xc8-h^?pS@LmXMPas(@lYF=+v&mPC9>q#L2~i)I;nQ(Vi3MuRMJ%C2vq6qQ5)fmqAxKhpVq=d?QhKAgOS0h!*cbmJ#SE@x^6C75%bdYb|%* z{5|;&I*BoE4Ei)a1SSf`^ElIeR=rpoAAVMsmcMSNOI#Rj;{>Ne>s&|6mPCz4KMyvK zYL*euS)Vm^{*fA^Fy4hgVkYiFKELO^GeK6#Hk7rD^dNuhJ@xfWd}8lrd*{k>afv)) z!;scCtr_+CmxO0R$G71EAL{)*%_OvURiE#Y9LaIIVi2f>S9Vf9&yb=BK4dAXgQ2e7 zj{25&<^5~}(fi25TXIdam+LKhR`eSMHtl)V=_rZUj#XMzIqDTW`GFk@rI!*TueM)= zrOY;ed!9wnCAxPg9YT>vJX4)K(Q45{W}=B@wQX6FE*8Wkf?qao1PL0-eNYT$)o0sN z-`=eOH%A1WMq7^NXH~vqis~L-BC@Kvb2fWIcYGvZ6d`;e97XGv{N%~Jz3P#^k=9{S z*;;mA5^~p5b87`;mcD5AdX-2SL`oaCS_e%_@eWMMTHpN3Jxm7NLyn`jc}yG^k%$;0%2bolaRY_kGN@y1n|&j^Mi1vKV|d7 zmSK?x;UVVr?oz$NDD7jPlRjLpeZA;}6Hz{lF$9%q$HlqZRpO+57WDDy;rO{ZRm*zW zB)k7vbz9mFl2;n;^C+c-`@52FFYlK2?T+S7W&`SJ2~L+j9<#b1n!XWK8nf$D6{ZIZ zmGici8t`@(mWkq|d%JkEXh(Hff6*@txT<`gG1+Du(?2Hrc1xkc;7w}M=(XAuv1hVv zRH@M>=j;A62052v51jl~4}4;7*0IQp33V<>a^A*;PRS&ns!;DO&Rzh5e@8C-v=p`- zEDlti1%D4wj$&LH*GZFTZD8l^bGaKo7o{(h)G0aQV%HMK1@2Q4jMH4!IY(kwWQzgpm7kpTWiwr$(3Rmn0|jK(`= zTcF0C^z~FGh-z*jTn-?9SVqx+{PNli;Y_8#Z?bYFRxc1c4B=@FLcUqPM9iD5r+(H- zyYJ;m*U=>X$^8sVA%wWJLZdZ$gbZV2D?5S$oG5M_2)I;HWyXwNWwVs_Emvl`=1hy< z${GzAKh5W-3OM&gZy1PGzc6IQ^LT|a=g!WrAj42ZNL*`yNmTG!O*pX0ga#o)JrLDx zQx78lItg+>9Rj8`k|=lRv+zo6nr2^D*E-WMk6_IUJ33mzzOH8(v5bSu^S~1gv(>>`7J&Ogc#LLAb`<7^Vc}-wS*lRBKVih5^l4?oMtkf>4H) zJBnR~#Kvcv2pX;IjQNnwOhv<|=F3^sA!K5aqOn2X3ODJ<6ZqL84U;b+^5Z-bF9ZIe z#7rPKPdkWv7ZL;Xl#hqX%F~xruOW83?>WXq3H{2DGmO&QJRK0U6hK3A*6{HzSwm#XDn}1R!9)%WGs$DG=^$E?BNpe!~35{Vr;b<5YKyLY@ z9dl}DXKSk)lI3~2Nz;!DJ)3$*eZdr;6fL|DLmiY)_n88NI^8Ugf!;Ze0A_hZsKhAq z5(_n4PVl8Cs>Geif#Us%&&p(JXOU3=I@KFr|0!-r$PXr2zKt+hSrGAv=}>i%z9_de zD9aX!(PEg13RfC9Ued3QhGuKvt41TBg+N-uX~SV8mW6>+FeG8#f(GhOr6T|)nL(Z{ zXB=4YY`BCroUqFf)nDCTu_RDpHIYIvQec#NLxb3Zm}oicOC>~^@!rB~nQSIByo%sj zbi*;@Jr6K1

_Umr$`JaqGCAAXzMm_oVwlNS!*QbkVfb1(%=qY)t@$1mmAm>^yna zlT7ufkU#9cHa0Fuydb8F8!4di(+(jas>)#(G_9M>qkOz6@zM{mj=%1;>P3!R6_Odp7mkkZ`mB9poWqScKGJr&jW*mY@qS8tMu(5IAqecZazr0fEDevw<-~6t5EGd2Tm#Zw zl4bT7eQ(-gWXRL02qxYzg4LCvG)pj??C_8-F_VjI2lAmjv%slt^JCSpAk%!Olysqx ziuqBg2pv{qS?aYBYi9^rQfUkd`6sg&;&!G${G(B!1S*#T&<_yk_NkR1xQwkk!bStn zZX|W%`X-+bg01g~y`fk&FAmh&r6Dj;kh8IAI3b_s@>63j-MhA@9r>-VG2%6id)8~FIFJ%ZOB%+wqi^>HU8$RJCbq^u93SAX z2X~3_H@8XI!yf1-$s3qGema`iIs=%1w-o{OiY88WE{;YfP5_qg3Bq=^&cORlfCsk= znBlpJv4w$Spb}mIe$t79DOc50)czwfyO!$AQ>QU~o=F*St{|0)7D zb=%6DlJZufujb-_&eynko~B)%%xU`GCytn1m;?$0PV^%ik2e|i23UepO)oCfJ;cyi zA|+EWBNJ9OY)&chT2F=bMd2oO^Wf`bvh{2{jZxedmW;?|`NdSymE1^(Z0*;i%VxQr z&?c?QlBpmYmlPj9e0?JW!hn&YH@robZ5_rWih+9&b^-07ZtE=V-f2x{6ARX}QOpaeEn zO63j#ASaL@wTo{~p9Wj&zUo;;Acv3G+p~58&9tJU{=$Bf>2EjG9?W@}^8VmPmAe3& z-|v|uCbNm?MnPeLw8-!<0#fpr@!ciL{aOd7#mHbc&6~^)?Bk#>Dhgi`U<1R%t8R7T99uR1Wu4~7SuJs(jH{W@5MNTtUsGQls1G~2ckVZ zg0Hv$TyUo}1L&fk1&sd@3TBRAaV%CIa*EJ2$`ef(4`D%h4h-;K;jmbgR{~e#WN&e( z0*bSR<$2GZIpe*9(FwTDf*uF3K;40Q1){zN`(S`L3sUbpSILbJBi{9XS;QUl4K#Mg z{EAc~79UbW7xv1NBM_P(y{>o|fdLrs2{Cn;C&wZ}vFaqCs6E zBF3RVa;MXgp(jxroDaUw)1?@~zIO6hSWO5ke`KO00Esn;G728d7R+5BY8Q)w-&;9n z;$Rey9&JO)s_0q47D;?E*?0`eU=nOa63UcXKV?ZvVjVJGvR0z_K`vu)M;r|SUAc0S zOn@FK@+*G%1PU??IVpTmp*dMjsU`sr0T1DPMZ~wZQ|iU9-=%BHwTgI2dC`l+bY-+UK=gpG-b65Qwwq-F%4O z?mZ*?(s~AKLTaM5L9v0qfwxgJqzKi;p#POw5jQ_FKC(G7=9Fc#X>4t5o~^<{_r8TP z@rufp$}^ET(KHb>F{W5X#qJ}<$Mj;GV!0^>OEOCX%Q?%Nk8~wbV})aG8Dr^sEcffT zAv=}t!d^69Tb_6C@b1LprQ)^W;o_;_CA0Lg;-=cB4yJCTZn5;&nrp(>L1=DhZq?~B zB&Hk;E1QzmL~v%8mG2kQPVnU|l<<}ry{%C<%ayGeMb~*nr^PPVpxvP4CL||KuTHO5 zp>UYZr;u5enb$7fuH`e-F%I~oA0e35`8GQ|yWJqeAa}*+S#ey+1XZqPE_WY?QN2U> zG2faNS!GzdqZ^f+=2n*1es z)d$JKrF4x^%2CA8TzCZRBkU?xJ`#5K*`>*41!QCqG( zHa!coymNPrH`K_|$Z_Hk{RxHfTXUN4-^9Bp0C+{i(2K$OCMfy;vsgHVO$gJ*(8gLHxD?gZ^@_9vvD)FaWJ z!U7Yz6?P5cfn`O|!Mt|Dp`&3bCEN38RB`uqcY~lc6crKfX_kl?m>dv~h{BS_auXXA z?-l(h{#^8>XtJnuvO0}M&AM2mW+b417mM%-=1US+$D+&01?iBry;Zoi13G5yq?)}? zj&&PF@5jnx^*S{isGdF{Bv?YZ4@2)N+K?$f(`$0ufS8GI4!L=?)knP*b*%H% zIj}l#LL|9IC(KB)ToO}q4NOk3ZrP<;@5FDtC?Pf>my)??qG-e-u=Z>12O&~}@JebO zW?u4E`fiq72H|%@79S5(&~pSP;$$eCG)$XcM}Lz!mxv@uBMs1iud%FNG!r+;98wrY z-8@@QUAKKtFnBt6gHnB=K3x8wv+Y13aKn;x?Fp{N!CrL^WDmM+ir{vz(!*3 z=)KtcD3d2kw&9CY3Rw!3ZTjXlhnsyuO+pGQt+bYUDEJUyIL5&2;6VYzsSFFTXuzU`@FFI?o>U+P(ycaynI=wY-9K&=cJ)o zZLdw$K+8cN*&@a`rk3e#b*of`rTE9foxB}CyqMI@)}5CP`{|$h=V#iNY&qrim;y7+hJV-phzbnS&Blpw{`&;$| zwyvmGeN}4?{dRZD=fW@1OSa{ibDO?Ga1e1Gul9c>Jj_7gnezzuD&8i)%e<%`*9_Op zp8h-?R-eaa@lxjH(zfr1bb?U=TK=?q&$Gyjl+>CbGwHFoQOq}KoSAOkCqjKe#}Q%J zAzW(CeD7-4C)3L-%NO7KC0FPkH#ru~x8B^UB$t?#bLwz=rF*sR2j4|aKW)xP(%bX4 z*seRYY2CK3xl39ri*AGT+Pc=cpUPUgvs5~67;zUWNx7$stnc-m1k=4X*3cu9i+cwVY z=?j}J<1gzU+$;_Y3d0vm+W0U0?y_%;&idoF_vF6HC1ytPyLrOip170Qm~Jjz2AQM)6-FkOzmKStsn5Uq%!1-IqZUfS z;QDrAhmKo7->usQkPI0DGRI@n39gb&wRf*Jq1fOV*UEjNZ26R>+N9~`Z2JSP&$8Mr zI6K+VGG2M)n&!#f+sbRP~cps=K}X-AJ^8WabDms+;43Ffjeyoy9OEg3ky0(l9g=$cqCyse!mzDD zgMi*L%%kt>dYkVDzM>*lz(Kuo_{KBguNHzW0e{9=Af3}Y$R=1_6JZ-DCKu5}6j3Yh zIF9S;KT7kVi;7ya1jMAF10E$Ij~U({41y>lFgXm*!{!y4DsKj#q!oXWU^kUw1$3iv zj|fO)8|(1=eXK|qeONIGp{tx2B@vMzL=KcRqkW*UZWJ~v+#m$7?}%3 zcd(9^#Yu7C-s0yXPKcaEX202b!ibJROvj6i`fx%airB^@f%f-xj_~wyn0j8a&3+-9OT)rC!Lz-;a_OCy_D= zUrvj6?1d)Pgw^SNx;*f?K=@PYIBG+(keTpk8*O$jVvIz(?w2#~G0>D%$vBcqb3mLl z^YlzG-xtYn7$<-BjHX~kVZrc1#YPF~9ub{w;tApinvWGZCle8;InEfeTO*8#1(8`^ zgM{zIO(z(ZI}L5OA`8h+YuWiYqr4bsiE2)y8n2@_x-WGS?eO`vB;pYS%`F6yU50lY zjG&7y?c3*dW*FV&p(nXJzwlck^=j?8qFTP@Z|pdpa6v4%2zpJxXhzX9d zJ;KBUjG%vSG%y^D4g%l|LOZr+*}n9*7WjAa(RW|F}HTQ1sFk z#!f(T=zFUkM;>@q{l+CR|9)J3pA2CGX9H_Hv+tA^kbe4y;d^Y`&*2jUF#nhwAv;H7 zpm;zFnET-`unpf!dz=LTy^@Qe^CM3tV_|Fcz>FCIIV&bsHacb|04w7I;RPHM209i1 zD-$yvko@{Z6g^h(v%Vi?8aZ3o*#Zj%(5om)e*XX-rhh-pj|mUYQP>y&g$;IGPA>NL zc8<;h2KM&WCUi!2Hk5xOoF0bfXBC0~mLGHYf9Vn%D;+BnfQ6a!p+}6IoOEm)02Vfm z-#heJ{Lk9{Nr(P%;8`9Le(#W_ow+TYv7L#4iMxT#KlbU5o9vPG`!x~DjxHwOYkx3g zzTXsl_#;d$=xAYJO)F$)ZTwvMJ6kAX;$-A#Vef3`2*>hs9owncT0Ck*fSl#yb`-O4 zbaECpH*kDd>+~`PKQA}|^lBEy&gKsbk%^rHzz*CEY;5ds?7x0LrvJV!AItnXNIx_v zOl-iC{>-HMo#)eu8?o$WKnXm25zbo>qCa<)tB*iRx-9JTb{@JizYcY(I(8#qsqGo* z^Xw2#&t^Oe*TnSQtxu@}bIqB`-F`2O>b5*ITMT#H8C2IHEpoe}6rV?xq2msZa6vz{ znwL{f@~61*H9_*uABoT4@(!(Iub6wnJ-)g7IZvvF<8tKLynv!Irp1B7g7mKLd>p0p zv`yIryTL-&x!bCfi;M(>xOY?@*+`8Q=FKz`jTB|ciDMCj)?J0mJG^(_T&o94n52g8 zj7W!aK!gOM8^MJXpGKoz3cxJzt%AaUx}nAhz_^FHg9#~OF_B(*F$aggZ$>V`I%G8a zWUncRKce2|ctqN9cW=yPu=_eexeB{@3h&pKVzK8z9sA=ZnGG!0ds} ze&mpafYddd7A+ePk7=3Nfut`31CStQU}XdDen}vRSQrV~npv9wR|mbIlhFe)&dI>~ zK;eGBpk;a>ZiNl(B}^>L%$?!bfxI!jlCz18>cbNS?!UV_i6Jmev^2daqo&Ub17y%p{z`p!gUu?h~_Al30 znc{*ph5`n^nd!k^co4U(rNxA`7FnpsL4A(;CiI2kL_jE(wTz{0?d~U2Lp}+4LBH`2 zeQ3&x5U3CuuNpC7cro(iuy~Ct0TkiLLP2b!i~FA~>U^@8oyPikx7-e{7E;<~6Y1=` zSYtuCh{fn<{lSf;n!DA;%+?q|WCGtsjVz0u%SQ$vd;JHG0M` zzLqy+ed$A$XDT}j`)WZM3>zP$s)49Wt+3#d|24lGTg64P zQ^cqBcb}k?Pg{iN>je@} zC-mml-~`>0wYFwjCRmCw;u9E^b>S>HlDu!6s)eCPjTtCL2O~_EXd_aEK{yqu1AcBu zF+Ri;&nG@b347^e^0HinX+brzUR*I#Kgc-*45bd`-jaY#uozus^&Tnd)|bj7Dql(v zC|nww{p9FRHN4@WCt|cRT~$x)dgZPOeXcns?`5bsYpO%tj#w`AZw-z_o%Z_B*smr} zbR5L+1vCsK2Oy6uGQjY9bsh3YH4_vqbS|pa@qB8A=*tw(83g;@rhelno(`^xBporG zAQvocNf@kRmXJiYgq_M~b3U!uR!FL)TF0%?@U~)|pcUYD&;re*HK4%_bF(vHXoR}a5J5OxKe=H@yr_4g}HZYk$nns^`{u3UcgdP;VOsZ)+Q#~fR>WS4Pzq;dHtD`yt3jT=&tU@QEJbhkO z3!iHzj2eqbcRINy#W}Rwi&o!1Zt8~v_~)|zwWohOa{m(PMgIlTf47|Im46}qce@0* zN&Xz^fw2E~NdI?;{`-ma6VZWTqJO#LbYeSgy8$R7hn~GpB41Y9 zB+xt)p|xBc?*77v+6`i#Mn(i#^&Or<3LP=-03L{3jUw zhiUxp!RYUX{4ZmV<#%)RCq}cd19AE1iUA%o|8m7pQ&K%QS=Kk#*EiLd7=2Eo4+7o? z0U%AxP)dQ?XaKP&i=67gfXQ4zj%9lf0w=KX+_zRr%vKUIfe_H9TcAxy*U*+q`3?$D25O)bXAF_r3vot(J}W~81q$T_6^n!5J8Nw=T0HhL31iIe zG*S2!I8Du$swkRJ2e#$nuN8Ve-{2VRF+k59y3WDQtLE=cjqI`Xzn<@^C4zu!LzNOr zFjRpCy{75Z7JEyB4E_WhSzZKnAf!nsK@cMkug3&ZNS~$>9215Lt{4nf3<@ilA8r`> z8%HJ6t*N{|Cx|j6L?nWc{GByI7Xm^wV7Yi-Jt5_F?ruV6{OS`hWeAX`?^!wjxXc~{ zy#F`Y5dRmj@xvqd1skm2ZS(&VY&<^uvz(U; z-w@-6lld2h2FLG)2GErRu(ATZ4<;tK$8Vrp`QT{vaAhCr`G1YVoY{G`&jRw@s_N2K6=z*hspeFhow-XEeX4 zoZ>Lxomo^rQ(VXcA^DGOplT~#Z!sM(Y4VLAv>`{(5E|iZO+Nf%RHKN)9S6GTL zjc^b)bmg6%bJ0}g{zhMRz{>q8f_zH=sbCL@>A=Lqa|K|cDXf?_q}Zp>3QPOVGU9Uo zx5B{d5jJeKa%F{NbO%fpe#%vN_Z>4OR zAkXV&=qmH)jWzH`#bYsi1CbI(0RQ9eecWh&+`a$gkobMm{^R)o0J{EctU%WvsPFpW z`ZIF^b!-2*>ks%-*B|bIM*0iaA4sD8rRxu1{Z2^zc0T+rnfqy00%^xz2Lubl1E%~d za;Z!`Oc;G;y4}t6Rihyo$ubxY80>)%iG!#^CnAKheK%yHmx3T=HN_1vAV*FTOG5go<^=L|3Fhuuk4dtnW`41qsNx)_%*a=C%4Jnazdbb z;Izos3h~<*2VVtD9wY+9}C zPCK`oQRx%ltq{&XCtL_PUzWHJ1zGd1(Q>wFx%X)?%wmP2EqK=%BWTD1H(^HTbEZ4{ z>0&^x*2Sbf%@+XEx!|`%=!~DHl8F&o^7VvG z6tdrXG)AI!Ju}4@(q;|!ZwfCGDrcB+nizoY6UIR|_}Ya762N!ryMk;4s?)vr?kSsj zdPl>1z>#Q>KQ#r480_cd&gN5mA}`gM_+gH zV2Y8o`Bxt$mNUIp3R0nn=L=v=9|f3ky-7p3+1F2py9C!i15dYaNb-tBQQ#kn!%N9Q zHHT5`cmmIYkO32mju)*g5ia)ZNo73_qCm7c1zw~Wu~m;?9;^aAiBM}NId2?pwYcj5 zz6LRAwJ4Ku&~+}~TY5(Uv@Bi}9#gP4&@N9@;FQI)iU!bWMLpi)*BRleu7PNWRWP8L z5}^<97`?1{YK^A#-g~UR#*QOtckKDvYLi`4CZuJjrC{6glg=mE`Sw)|vaiW`=t{BM zP*TFRMRr9za$FU}kws%9d2jum2!jiQ_yF=@FcJqyFYG(B1h(`qgtjn1cLV~O;HStG zzHqW1z$mPx1NrZ(>Ow>Z#q%pJAN0KYn^$2%fAPUr;+`oc5Bp0Q>@95+>6P18E@`l;m z2B8DAYZIZuiH-u=Qn$yO7uD6zG8LXQxhF%Utud#mbuEUW+!N1QA=bwYC2GzVO>o43 zqPmv}@{Q|J`1`zhtMp!dHu>KF&-^YqR%Sg@*W(_qU}kTr*)-bk-KvtPIv_ z$80qv4Vl|mYOMG}CvJK5NBb{^2Yve)#xpP!7A z%(%Aj$?_5U_t*mc_NI&+J5en&$4sF8MgYO;_JCKZ5glGqnNE|fE0iyE4sx_ZhW3YT zy45l)AhGRZqCD){n3qY8>)#Yqd(~ycn+%s&qOFO4d*_{>bzw%84t*5s?^OzbN5!I4 z@dSYp#mlml8mXuhMXSP6WwQPja~EjcPx*eOrrT4;OO?4CQSe1K_RQqdxch@K@0@14 z@fWt4*e-8r_{zhz^wGFJ8P&dzcxj~?*X}cW;R&A+fo{Li8{Rj1^S&hM4egwglJFgb z2c$Qbl#NKPMyPtWL^dPq$LpFxr@_;vRBspSq0G6@&^iaUhEiOkiyv3azDhBUbV6nc z`;2#j#hS99C{d;N%lq*=QTUUaZdS(}U{`1E`U}s5@bmesXsC;rKE2w#usKod+=;AQ zKHKZaDm|`oqmd{d&$s3sLc*^mKU+=@fd-EVMz9;Q{33*-H_#r64ON(5t{tB;J-1QS zeJY_Sg^9szRIiXyTBH-T=w$Gw|Lt2r12I5GQQ#Oa>E<)tA?XHPa_J+z*f05=sTI#g zoed{8ZnusdDu#*=8E8Y;`=(XNNRn0|R~scR%NnLHtIiO+d2M-nR=w7-Pf~BiWC{4k zqOVS5lpN^lr~-C6@NnM%UoA29g`|n(>Hv3@)Y95gc#1uZYNybgMKB@~0YEPDobu-3 zQ8)is8jCGc!szm}dzR9gQ_RwKs0e!A;-e5zVI}a*)=t|p%2(VDSWHE)k;N0y3`O@v zFoW%x3>vLx6c0%;i8d;7u@WO|ck5ld@uuo-lh%G^^v5h_d=)4}qjz5>bcw4Xbn zNBq`w2USewvC?)WUbTPRO!8$;OHt9oPoYR4?Un9~&{H1Er^P!l z3pJNA(pIp;RL%>>c0JCxARm$iGQzWp0gwqukkn=)>!pgM7*&ziTs2?6F?J{CHyoF= z+SUh5`ClX3Hu}k!C9QV7qd#%%=nF#Wmh=Qj#`%5v~2kW*+*P_?zm+%9X4aL&}7hcUDZ)y76=Esg&5 z&W>m?)Es*p-MJK=8PTyC4(Oeas!pGxwl!OnKPbx++NCq@I(%Eq{{h1E6W;< zqW+WSVndSV8JO^TahGL>L{F4+y z-;+!(Dk9&Q4yeV8^(DjEtd3W!)HF&y`SCD2+^?F-2tPOGMU&+Aa)9sc!!JZm7)~$4 ziqR$eo|`{kjp{|1+eU438iYUL9qc+rBz?a?@Ij(JYq5uAS*B;ENB^y*QWEn|*MKKfxe~B&1wMg~y4asg{zJ_9-GVG~s1AdlX+>gxJ6qUswKi z&}sP4pt)e{mRnkcPjz)eS_5n9;WtCA7_y5FWf?DJJgxWSiuF2 zG4KJt#B@R08g2=aBecf8&@9)W@_Py1dgnS3S;Rr!<&bo}J+e!olAgd^+)U!JY}OB}-ekmebuVQVrph z;UK8lwmj=A1fb9AE<>?p-9fFF4h!W5g*mW#UV_WwTNrVJZRxTJFgSmf(IkBG9?>rg zuu|~3~)VYxl4wmKY!?|H;C=sDRA+YMJ4Jqkt$IpY%tnBeYKU6 zc1ECYwnXjNlSlf5hJAfFeU;zLc#n`^JHBuA1}UHi>1nQMndT>vzTWQ4N>kNj8Z&mvZzum-=DmFweX7rg%x|tFBsIt^K((&x)Vx z#alIYyIc6$@kG zKrDJqOM~RowCaxCd7oQ`Bq+Il_M_4iYdLoJPNv*npHeK%e|Pd)`egtDFKqqCS##DG zbygGB2ixbnhijLq!;6DgF8J0b>hRSCV1%@Ce#xvVOHu1t0rIeH%12xhMTCI)6+`8~ z%YiMbL@ol@D9Y-&No_?Jsg=h8H0f`0Y2b*`Y;I zQRko3^cxp!O8Ps*r$-8(fV)qMk)@cV9aK=Aqg_rSBc@?`B`9Hi99BPENiYvJWg>p2 z?N*Ox9`y#vp%{$IYo!FUt)Uh`RlDn%<8`pQ9a7!i8ewwWzQsc_wrk1EY$=y-HqB(J zUdJ2E!&G~h;a}#kG@UfPyHwAj#BclxMHb2V`PCuV*MwzSPzh=N^#m%|+WJg;qSNuO z>75!gLhQVeda3-1jwECkvlCGhQ>?7@L)*}0)N-sVC)U~3XR^g zuf}=y#dxp=NN*(<&OO3>WwK0S@kBYB>h)Ybg-UDg%HO}_)#*o~RgZPIEm{}J{1O`| zpJF}LQF|)dMwEz10%W4CDvon}z88bGskGcrR3rIB)8t<+5R zm6o^iE#7+7?u!uw%)NYEYdgU?igXKN>DpU%oO{fUQ0PWEZ2~?X+Wo2R52oGtrP9cu zI$%SuS)0@`WKSFsn|XqjVnvFw=EugL3qu}?v3~I)f&~dH8oUCSk(1X}P+VeMPG2U8 zym)DsnMFT67JlR$ZR(0AyTLhnrW>2t8yTPs`a&SV3+ybi?tSDm9W!fRzvNf;`?`iI z{0ko!l2=N2lm64Rkp|U~X3IXBvugHY2yLC}h)`4Jb@qk&D&tE9ajh{!;z)S5K09QT z9h{ekk>ql?&@zi-_`E6$g4a0TqPA#=-mkRNQ}`(1oRQ}dR8ZfIJl({@TqQiHI`ZLx z9+mndh4T3R+aD>Ee}@2-{%ZpCyH?@1NGp&l{HL0l-wDtMGVus=rPfzcASUqTorAangMxv>9_f~+WnT?NF0Ff0+U zU%1ICQ)Cdp5Uvpbj5r`wbE9v(aJqE~A7iYkk1^JFSdTH**@|he_Np&^KBJ4W|IRVo(cZ3iVeGXevie+@>k{iAyJG2tgR{$$MmhY|liMtJ_hh*rige_}`cz5#zNtbfOjko!xF zVEdt5{)G{2KSHbj7mRp(_V3pKFo5}Qgh~E?rr3HM^`GPPYbyT6-R z?1;a#BY@WRf9PxcK7;=~81db0{S71j=!^ajBmNVN0Q-dzD~x-8f)T%Oz+Z>Yzk?A< ze~l5}UB=(|2;gf>|BR2|1j;e~=4<>EQvo$ja1T1C2NBhS@aa(i^;4MjAclGnP5pNL zkoU*CK>#+OKK0?52SL`OP6}Ao!}srkte?Msim`rP|6b2yTYws?hdhtve6RPrAPZO@ z2M6ch^jAN{Uk~|ymibWD!}UWsk7WSc^(f%_S=jGyc0CT#PrUvChO8VM|Js+77W zKz^3QomM%;KTEJR)}Imv>(ut%O8#DFf$`8y=w-TDIe2sEUcorxIP&&gd1{{|G>#95 z5c8i2y@8AOKRkJU$A$l%ljl1b^4rPtyMy)9gJtCazG3k5T>{{Hng52EvP|7WTdBVx zv+0rylpRgT|A`O`2}OAzilBl+CuUF%OekC<2&gs)Dx`?3yvRg2H}Dcq0V^?-Fj3P+ z)Kbf;>}}nrLY14T>2KC$D@cbj9n)hso{K*8SHn7{qi;B!k`6hY;?bsQgt5$@8KP>U z5mv#a_zbf&Vv6^cX)T_Qj*3d);kMEkTFh|o_%@5PI6Jo*xixZ&_LJv3w?lW|H8dtl zzw1vST=w$v6p>>0aVA_OP!;K+-_1$R%stc@yN^*rXz9z*Ct)}>j>U-*s6FhC;e)rU z<%L5xp{d8anKtwd^2l2?J4d-5-}KS?)|M25_ACO#z3$eGZWul4SsT>Oso`9YzGihzTfcNFZ`NO`?dsM!r0K%0?hBSC;0fy zQOCiyxyRWwW^)t$vt)Hq0;M2l#vRQ~jEFs^S2+3g}Tr%Jg z!g_5SS;Ibuac0iiaDch$Ucmk&^O`>BC0|=GTvj{C!CVJ*=7nII%pt&uE;0&QliYnQ z*o@j7ZkG9AT8N%RI zeiG4B<|w7leVD!?i#C0Ed4_b^om)wd)DvM3%NKmebcaiT-i`BGRkaVQ=EVi3SudnL z=V#pVsF%a7tL}aQ9l|Jd98_PQ55Rov3hS~rXg#VLkYbTdAo)bg6#hwMOys`GzUs9} zoS}k^!^q^2qE*v4(}xq5(qZ|2H2DT2LC3sU3X)O#PJnv|_{$d_K4^C#c!@;Ec3xP~fI)rr`0hsMpph%%{CB+L!i6VBWf4ejFG>U1T z=58o2D+m&P=N!80)eT}^Qm?Ub?|%?#Dl#O=0|{B#6K&z{ zrh6LNiS1yrki=lWkZ$#+gYpuYg$b2_h|3RPph3ldbhkoWlu~W)nJ@X$8eR%=jd${F zRLlm2Py(C``dJAtkF(jQ+pT7g7nL?PCF~`)*P=8aQiPt-6U@gLEBTs1!y&klW2|hG ztWoxZ&`%kJJaFba`f*q)PO)W)8zHVM18@zkENPlpts(aERV90;G|{T0wkgm?BU)W3 zn&>QYzNuch?MkH;ANbsbf5BJPnBiQHF2u*Y#}|Pdt5f0AJ)wS2MTR`> z4Kr%zuOi?bq)3brUJGaoW@P>PIkkEBIFy9WMLYFwuxJJwd|y3~19e=;nEO2j6@h#z zfyTD^qDwNfUuPh(>{PaqbvpHBTluH>)H*_T$t5qR-7$-PO{%W{r@bqWhjRP>vSukn z_N{E8nAOZ6OR^=|MJT&$V^{VqWV^0COXSLyvLyQ!$`X+!O9>HSq_Gtdzh~U~S>|@T z_x|^Ny?S2$nUBvo&pGFF=Kaib&ilMSq|8*3rilix946Zq2ieyVO7?mBn@c5 zVYA0`mjw7j;0DHBV(#S%jU2995yWC}A^KGCo{gvbP z=^8xK)YbGvN0tfZGH+!(WeJtfjw?FS5x0oy@MphRC`xncve%8%z2J$10+vV%eoo4V zn&AS(3VA+~3GBoP0o27t(mW-!-3}?&WXkx3^BdG)^{rV|w<_DyZ^vpnk?Zr=t#vU< zjGP#Gw07?4NEK>gvg0Gaq7&2FQi=_)tCaf8hS4GB>6XJqWX80s2v>rPR-y|1KzTp$ zn47Lxz82A~S=H9$+4aM>IauF8MPe4LezP7M@VUKob7R22(K48<27O^+?ncwmJd91n zW63YkKHM#}RrNfn8!I2O6Irq6Q-kwGE|$7T=<3or71&<0J5{Pld{s#pDero=mO18z z+TG=%vVOUuNV#{D=c)7gL@`lk=4&&q@)z@EJAbBHC8;Y*H*w^e%ix@AU~p?jxV~_?$VYp)H$3t3ljisK$9zX>l{Zg@c7#WEo-*Lj9)o-98(G93 z6PjQWH^QR7W_$_>N%+Vjv3yR(nZI$hr@#F=+U3Hhao#l)`)!}sQ*3W;X=}=@oU>Cf zXF0p_%DmPwqTf*ED8)76aXsaZ;+)3$J4@&a1M}Z=7j(_3R;pt#w8*iq97daAK zcLlQ-RT?)Ssx4pa0!d$61s-vVqLY118RN4qI1gJVImMp0_EkYcaYQA#+U0az z1bYDL0iIkDEY4B?WF$6U1O%amazr=9+arZ+7bYxX|KbIaZDL6IybF<57DA`>Iq_Z54;@lCALr9{TK zFShFx;M)>+OKEOu()I}|lB1N{1aJ z_FtYgWXt|>In&fzN)JIXe%1ZGNPtE_$sKcZq-1#1zTC7HY-BEWKOH zw!r2?UTB|~>kBK8{Y)ND{_Y4XsoM_F@ARJueF#teoVm#IU= zi3$iSs6knpd6ngBQPJ$u_EQ1+J8tj7|ux=4dap44f9JROXmC@Hw-I;iP!c z3-~D1wSkwd--l5Iewk0<6>D45DQ<@OmNeIjmcvdV1mLxEr?WYyk5kbsUc_oi+vcHZ zh=hkq1qkTFM(H0vpDbWzwVU)4h9%@y+$$=nW%P}%d&u`f%9bjW@@D8O>{Rcw`Xyp{ zA=#BYr>^c*G`4xgJ)3BtmD3Z;z7Te5iJ$L6+L5eBkq$4lh4tOttvVt*BvZnnj2+!q zO$?i1e1(-sx0cz=2=$QM(>3{3Ari##$#H{@DXj!D_JMD|a!P9+rky$n6CGn<2|Xk$ z^LuxIK($8L4N7~xbLU6B!|?BkXVJ-)OR?m8>0(AN%j3a9G>U-`klbRAXu4d`L%f zFxrESiO5AHvAZNlgCwbf`VJ|{Auf{^VY{9IQB{@H>`1$)#$*8!X3X4$6~hHrW0}bM z!JhtN_d9m);^U3c3*hdfg4J{GbuG@8mYroYz18)x&QV93YGidj8`io{-MGozr0D~l z^oVMvHXsc%+>k#{c?Hzsc`6=4efO@qNr6^ISfAo!=CLQ-!OzY=e>JF~)R822Glbnn z5y`Gs?Lp}7oPkPkOurc8HQt*1j*gIf%MFwIV1Ta8X7TF*0}scMsj~jU!ZQEO4e|71 z5g9VI!~<)WOM4UX z;R#|3h)p}$lV)!>%zDB}!EzA?pMy@C8wwPsiZ&6kjeH%bVtog<<%dGdsU9k!j+K<(4b9lT+?zqUN7~JgHhVj?lyJCKDu=DQ@!c7izSH=~n$1jq$nOI< z+|!tGk$X*lPR%BwS$v64jx$S=FX)5qjloA*t(kq+)8zEF1lMIU)#cj9?#LdVR&Ii)|r`5hlnd6i>B`4#-gDS`}Qs0a5a4 z3RZ68w-mh}x-tr3u>EAQ-q^Ih_Vm2FkF}Abm~WbODFIW%paPuALGbF;t7AU3?!7z{ zTG?*87Ck**%S#u)$+ezhm zb#uY<1_Vx&+IMWS^w3Gn5w>~DhK(tu*agvNZ+@GNy*6ZDaal*Rkczc@wl{bzs{~Su zeAzh^&K($8YIz;*%!Mu<=xo{$&KBl<>T6H#TdpsqYRvBbS?rCYrNx_4MW3h1WN#-o z$2ktkn&j|uOlVbG=iFa(R(?;HRC9De^0hq{#;Hm=sz#JPf>j~ubbDkFlIcHGDOKRX z&ndjI!I6-&S+#b|eMl=Y{o=*+lq)TY#Z|~8iSgGy&l>|BKC(VQ)UDO5h`scd41G)g zc;J;6slTZ*<4M_Mf{%gJv}*EObm40uwPgy7UJtG;f#5C}hqmr_aT6^uM^s~(8iG^~ z9FV(REfi|CbW&vG2?x5m$D?}X3;XP`%AyW!U75wg?CKh-&Y76uFh}qN<4G2T7XT(Gm zmI$6#QSKJ|SW#v9Ti5h4UjG@!GrTz-yv=uKCKqXmoP_5o!hi!DOTwQxm{N*tAC^BX zEP1`$F{C+AIaqx`IdXPgX^>xRQd-=Knm_q>#it=aI4)YLD{@_G5~cNPQ39oq$O@yhAaE0obMh;IaI7@Zj^Do?-T z6f!&-$)jc&IMF1Z`|R{IR9y1Ir$3&d#?ua+<`p6&3~PbS+23N2Qb=jFeUB-RNM@m_ zDD_*ofhF+Kq2a72k^)aNn;tsRB%6~d(|9jKxbv7NhxTDkWMAdxVS$De&PSfn?;|Mr zd{x;|>ZZqQR|~NbG!JQ3Pbd<2X=@&%(YTi#bIX~mUr|6dkCgH)HrJ)2h#BfK2V3>v zomLvx=co{SRlPYZJ?@n%pG{I3n`>Ph-s`6;B;FY`TEYiCl@tp1Ru>2J&`1`l>pjTm zPXAqJTymLldN!7Zj`ls`voAr&{mL`=EBD({ao<;q|J};H_77L?5#OX1c6Rp=-#7vP zWp{78?Dr=A+THu%?%wvOe;=ovQ2q@AV0Q%nXVzE&K?U3<(9Rm`w&8ofzpt_WVhR0o zme861-`&$C0b~CHBX(0Pb|%(8+1KDT@3jeqay;1iWnhv_4>{aPj!L9|Q^qz8F9ELm+T^ z;y?F8B2fRx4~4`XjoMou6oFF@-)lp{5O{vu9((;DAQbM@+}^T(Z8&k|{Wch&oW9=%N8n@$6Y(~9tFW=-`i_LA_0cnJ{t;# z*gvLVkSJc8!5}D(`L?G%V5IP31O_8;0@Qo`z%bbUaRCFm>HGT&1`^QY*&sMp-oE;v zaQyZm!KnRX3PvDNcx^_)a2o4-+lPYS%|9>_h4;M#qX6vNR~7<-;>8ls$pTUa?7asB zP|U_F3x?ue%J#Gc0tVs52m%(x8v_Uwgf};Ua~rtDoW1oyftkPGhJ=Ig+6T-@y!Igx zxFf22>qEiu<^co+1PRX%i5E*K2&W>xw?3d7c>REaka*t(C>Vmv>F~WRE>3{nzqu0y zF7KtLm4`X-OaT%O*xS2sZ6`a>0JO=`7WP1*gg+UrN@oFCdI^-dg&EWYghYZ(QD9>j v0s(?qK%sCGQv?zYH5WCJr1-hZpNDXrU4RGX_R|ashN8d}yu5O%@)Z9E4yjLZ literal 0 HcmV?d00001 From 535568475541db0e3d7b359b5b8be1a45e72add9 Mon Sep 17 00:00:00 2001 From: pooja holkar <37286638+PoojaHolkar@users.noreply.github.com> Date: Mon, 25 Nov 2024 17:34:42 +0530 Subject: [PATCH 17/38] Delete examples/notebooks/PII/Invoice.pdf Signed-off-by: Pooja Holkar --- examples/notebooks/PII/Invoice.pdf | Bin 33150 -> 0 bytes 1 file changed, 0 insertions(+), 0 deletions(-) delete mode 100644 examples/notebooks/PII/Invoice.pdf diff --git a/examples/notebooks/PII/Invoice.pdf b/examples/notebooks/PII/Invoice.pdf deleted file mode 100644 index 7b372f7f291e713de14d5b63ffab811c7935b43f..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 33150 zcmeFZby!u~`UXmOHwdyoy1Tney1Tm@rMp8)K)R&6q&uWVLZnN&K}zld_2@Z!?-RfK z+&|89k3O!o=9puS9`E?R_j{+L@*-lijC9O!q&*ua8@q+4S>ru@aLfP(fSsWw91jnG zUd+PU*~Ag}ZEfIeB4T1>XKVtXmoc$5b2bMsGcz&(`1s(QoE=RJY~b8M>$GfQKG&mu zE7v0`l&#)!HΜ*tR6+vNmdelII+ua|1g z6G|(Fk-K>8BXu=OE-&@k8(a4+m6pCK-sdjzUhNeh>?|DK-pk$H%kf<%)0gk~OgHdP z+u5~J?VhR{EE+S-H=8oCsHKn8-JBG+-Hnl1)SniwF>#lMV6hB?N7M+OFT_Xm%pH3= zD2>~Of0}DDb2u>G4_?;NmSJZ<1)0})_UU;_=CV#B91=YHkw&A(3zeWJ9Ha3aF5lqd zLkRiy__S>t$h+PO;fSVRXIov5&MFoXkj;M?BZ8oz+Zppl@xh~cV|WY^Qa-S|8O>`S z%(Lq+t}|7!+Y1@&w1IIam7H2+a+f0wPtSWJb$hPOb8QgvW;ne$8Twkf`@oy1ohSJT zQk>;XYScj?721N`!qjk00Bp`T75f>eOD6?9A+EKY19pOxTp>~+a~s$80qT)mE;{Qc zWhEBbYzb4_JtwWR<4BH$JxJ4Muw>e+L$VV6`UgAu$28N-!v%utb9h+Dz40?OY30R6 zgME`QpqiCSq_)ocvoaqzAe)R`cH0I->nm#lDmi^S-)@CDZ*ehVXuXc9V3h3h}qo2==E~>^9Z-0^B z(++J9T4vtuXCXe?94ssK5*ERxo*A{U>*Je2G05>PBGnNV+G%)25eOwyMtn;r{Vi^J zEpjI1^Ojm-NTeMEarp*R+h~Vz5@CW+W|peb6$h$nfvcrVOhTYb_;nSgnrdpf*G!>~ zc3o8j`emZdN}KQ*)G{;eCGL}JPprntzWp0N)x9TC=+_(*_W37T6iI^2V0-f>?HiX$ z*;XGiF|ciWTG3D;9a}bi$E!E@i0O9IhfH$9FVqZ%PBmLQb1v;oi!Xc8-h^?pS@LmXMPas(@lYF=+v&mPC9>q#L2~i)I;nQ(Vi3MuRMJ%C2vq6qQ5)fmqAxKhpVq=d?QhKAgOS0h!*cbmJ#SE@x^6C75%bdYb|%* z{5|;&I*BoE4Ei)a1SSf`^ElIeR=rpoAAVMsmcMSNOI#Rj;{>Ne>s&|6mPCz4KMyvK zYL*euS)Vm^{*fA^Fy4hgVkYiFKELO^GeK6#Hk7rD^dNuhJ@xfWd}8lrd*{k>afv)) z!;scCtr_+CmxO0R$G71EAL{)*%_OvURiE#Y9LaIIVi2f>S9Vf9&yb=BK4dAXgQ2e7 zj{25&<^5~}(fi25TXIdam+LKhR`eSMHtl)V=_rZUj#XMzIqDTW`GFk@rI!*TueM)= zrOY;ed!9wnCAxPg9YT>vJX4)K(Q45{W}=B@wQX6FE*8Wkf?qao1PL0-eNYT$)o0sN z-`=eOH%A1WMq7^NXH~vqis~L-BC@Kvb2fWIcYGvZ6d`;e97XGv{N%~Jz3P#^k=9{S z*;;mA5^~p5b87`;mcD5AdX-2SL`oaCS_e%_@eWMMTHpN3Jxm7NLyn`jc}yG^k%$;0%2bolaRY_kGN@y1n|&j^Mi1vKV|d7 zmSK?x;UVVr?oz$NDD7jPlRjLpeZA;}6Hz{lF$9%q$HlqZRpO+57WDDy;rO{ZRm*zW zB)k7vbz9mFl2;n;^C+c-`@52FFYlK2?T+S7W&`SJ2~L+j9<#b1n!XWK8nf$D6{ZIZ zmGici8t`@(mWkq|d%JkEXh(Hff6*@txT<`gG1+Du(?2Hrc1xkc;7w}M=(XAuv1hVv zRH@M>=j;A62052v51jl~4}4;7*0IQp33V<>a^A*;PRS&ns!;DO&Rzh5e@8C-v=p`- zEDlti1%D4wj$&LH*GZFTZD8l^bGaKo7o{(h)G0aQV%HMK1@2Q4jMH4!IY(kwWQzgpm7kpTWiwr$(3Rmn0|jK(`= zTcF0C^z~FGh-z*jTn-?9SVqx+{PNli;Y_8#Z?bYFRxc1c4B=@FLcUqPM9iD5r+(H- zyYJ;m*U=>X$^8sVA%wWJLZdZ$gbZV2D?5S$oG5M_2)I;HWyXwNWwVs_Emvl`=1hy< z${GzAKh5W-3OM&gZy1PGzc6IQ^LT|a=g!WrAj42ZNL*`yNmTG!O*pX0ga#o)JrLDx zQx78lItg+>9Rj8`k|=lRv+zo6nr2^D*E-WMk6_IUJ33mzzOH8(v5bSu^S~1gv(>>`7J&Ogc#LLAb`<7^Vc}-wS*lRBKVih5^l4?oMtkf>4H) zJBnR~#Kvcv2pX;IjQNnwOhv<|=F3^sA!K5aqOn2X3ODJ<6ZqL84U;b+^5Z-bF9ZIe z#7rPKPdkWv7ZL;Xl#hqX%F~xruOW83?>WXq3H{2DGmO&QJRK0U6hK3A*6{HzSwm#XDn}1R!9)%WGs$DG=^$E?BNpe!~35{Vr;b<5YKyLY@ z9dl}DXKSk)lI3~2Nz;!DJ)3$*eZdr;6fL|DLmiY)_n88NI^8Ugf!;Ze0A_hZsKhAq z5(_n4PVl8Cs>Geif#Us%&&p(JXOU3=I@KFr|0!-r$PXr2zKt+hSrGAv=}>i%z9_de zD9aX!(PEg13RfC9Ued3QhGuKvt41TBg+N-uX~SV8mW6>+FeG8#f(GhOr6T|)nL(Z{ zXB=4YY`BCroUqFf)nDCTu_RDpHIYIvQec#NLxb3Zm}oicOC>~^@!rB~nQSIByo%sj zbi*;@Jr6K1

_Umr$`JaqGCAAXzMm_oVwlNS!*QbkVfb1(%=qY)t@$1mmAm>^yna zlT7ufkU#9cHa0Fuydb8F8!4di(+(jas>)#(G_9M>qkOz6@zM{mj=%1;>P3!R6_Odp7mkkZ`mB9poWqScKGJr&jW*mY@qS8tMu(5IAqecZazr0fEDevw<-~6t5EGd2Tm#Zw zl4bT7eQ(-gWXRL02qxYzg4LCvG)pj??C_8-F_VjI2lAmjv%slt^JCSpAk%!Olysqx ziuqBg2pv{qS?aYBYi9^rQfUkd`6sg&;&!G${G(B!1S*#T&<_yk_NkR1xQwkk!bStn zZX|W%`X-+bg01g~y`fk&FAmh&r6Dj;kh8IAI3b_s@>63j-MhA@9r>-VG2%6id)8~FIFJ%ZOB%+wqi^>HU8$RJCbq^u93SAX z2X~3_H@8XI!yf1-$s3qGema`iIs=%1w-o{OiY88WE{;YfP5_qg3Bq=^&cORlfCsk= znBlpJv4w$Spb}mIe$t79DOc50)czwfyO!$AQ>QU~o=F*St{|0)7D zb=%6DlJZufujb-_&eynko~B)%%xU`GCytn1m;?$0PV^%ik2e|i23UepO)oCfJ;cyi zA|+EWBNJ9OY)&chT2F=bMd2oO^Wf`bvh{2{jZxedmW;?|`NdSymE1^(Z0*;i%VxQr z&?c?QlBpmYmlPj9e0?JW!hn&YH@robZ5_rWih+9&b^-07ZtE=V-f2x{6ARX}QOpaeEn zO63j#ASaL@wTo{~p9Wj&zUo;;Acv3G+p~58&9tJU{=$Bf>2EjG9?W@}^8VmPmAe3& z-|v|uCbNm?MnPeLw8-!<0#fpr@!ciL{aOd7#mHbc&6~^)?Bk#>Dhgi`U<1R%t8R7T99uR1Wu4~7SuJs(jH{W@5MNTtUsGQls1G~2ckVZ zg0Hv$TyUo}1L&fk1&sd@3TBRAaV%CIa*EJ2$`ef(4`D%h4h-;K;jmbgR{~e#WN&e( z0*bSR<$2GZIpe*9(FwTDf*uF3K;40Q1){zN`(S`L3sUbpSILbJBi{9XS;QUl4K#Mg z{EAc~79UbW7xv1NBM_P(y{>o|fdLrs2{Cn;C&wZ}vFaqCs6E zBF3RVa;MXgp(jxroDaUw)1?@~zIO6hSWO5ke`KO00Esn;G728d7R+5BY8Q)w-&;9n z;$Rey9&JO)s_0q47D;?E*?0`eU=nOa63UcXKV?ZvVjVJGvR0z_K`vu)M;r|SUAc0S zOn@FK@+*G%1PU??IVpTmp*dMjsU`sr0T1DPMZ~wZQ|iU9-=%BHwTgI2dC`l+bY-+UK=gpG-b65Qwwq-F%4O z?mZ*?(s~AKLTaM5L9v0qfwxgJqzKi;p#POw5jQ_FKC(G7=9Fc#X>4t5o~^<{_r8TP z@rufp$}^ET(KHb>F{W5X#qJ}<$Mj;GV!0^>OEOCX%Q?%Nk8~wbV})aG8Dr^sEcffT zAv=}t!d^69Tb_6C@b1LprQ)^W;o_;_CA0Lg;-=cB4yJCTZn5;&nrp(>L1=DhZq?~B zB&Hk;E1QzmL~v%8mG2kQPVnU|l<<}ry{%C<%ayGeMb~*nr^PPVpxvP4CL||KuTHO5 zp>UYZr;u5enb$7fuH`e-F%I~oA0e35`8GQ|yWJqeAa}*+S#ey+1XZqPE_WY?QN2U> zG2faNS!GzdqZ^f+=2n*1es z)d$JKrF4x^%2CA8TzCZRBkU?xJ`#5K*`>*41!QCqG( zHa!coymNPrH`K_|$Z_Hk{RxHfTXUN4-^9Bp0C+{i(2K$OCMfy;vsgHVO$gJ*(8gLHxD?gZ^@_9vvD)FaWJ z!U7Yz6?P5cfn`O|!Mt|Dp`&3bCEN38RB`uqcY~lc6crKfX_kl?m>dv~h{BS_auXXA z?-l(h{#^8>XtJnuvO0}M&AM2mW+b417mM%-=1US+$D+&01?iBry;Zoi13G5yq?)}? zj&&PF@5jnx^*S{isGdF{Bv?YZ4@2)N+K?$f(`$0ufS8GI4!L=?)knP*b*%H% zIj}l#LL|9IC(KB)ToO}q4NOk3ZrP<;@5FDtC?Pf>my)??qG-e-u=Z>12O&~}@JebO zW?u4E`fiq72H|%@79S5(&~pSP;$$eCG)$XcM}Lz!mxv@uBMs1iud%FNG!r+;98wrY z-8@@QUAKKtFnBt6gHnB=K3x8wv+Y13aKn;x?Fp{N!CrL^WDmM+ir{vz(!*3 z=)KtcD3d2kw&9CY3Rw!3ZTjXlhnsyuO+pGQt+bYUDEJUyIL5&2;6VYzsSFFTXuzU`@FFI?o>U+P(ycaynI=wY-9K&=cJ)o zZLdw$K+8cN*&@a`rk3e#b*of`rTE9foxB}CyqMI@)}5CP`{|$h=V#iNY&qrim;y7+hJV-phzbnS&Blpw{`&;$| zwyvmGeN}4?{dRZD=fW@1OSa{ibDO?Ga1e1Gul9c>Jj_7gnezzuD&8i)%e<%`*9_Op zp8h-?R-eaa@lxjH(zfr1bb?U=TK=?q&$Gyjl+>CbGwHFoQOq}KoSAOkCqjKe#}Q%J zAzW(CeD7-4C)3L-%NO7KC0FPkH#ru~x8B^UB$t?#bLwz=rF*sR2j4|aKW)xP(%bX4 z*seRYY2CK3xl39ri*AGT+Pc=cpUPUgvs5~67;zUWNx7$stnc-m1k=4X*3cu9i+cwVY z=?j}J<1gzU+$;_Y3d0vm+W0U0?y_%;&idoF_vF6HC1ytPyLrOip170Qm~Jjz2AQM)6-FkOzmKStsn5Uq%!1-IqZUfS z;QDrAhmKo7->usQkPI0DGRI@n39gb&wRf*Jq1fOV*UEjNZ26R>+N9~`Z2JSP&$8Mr zI6K+VGG2M)n&!#f+sbRP~cps=K}X-AJ^8WabDms+;43Ffjeyoy9OEg3ky0(l9g=$cqCyse!mzDD zgMi*L%%kt>dYkVDzM>*lz(Kuo_{KBguNHzW0e{9=Af3}Y$R=1_6JZ-DCKu5}6j3Yh zIF9S;KT7kVi;7ya1jMAF10E$Ij~U({41y>lFgXm*!{!y4DsKj#q!oXWU^kUw1$3iv zj|fO)8|(1=eXK|qeONIGp{tx2B@vMzL=KcRqkW*UZWJ~v+#m$7?}%3 zcd(9^#Yu7C-s0yXPKcaEX202b!ibJROvj6i`fx%airB^@f%f-xj_~wyn0j8a&3+-9OT)rC!Lz-;a_OCy_D= zUrvj6?1d)Pgw^SNx;*f?K=@PYIBG+(keTpk8*O$jVvIz(?w2#~G0>D%$vBcqb3mLl z^YlzG-xtYn7$<-BjHX~kVZrc1#YPF~9ub{w;tApinvWGZCle8;InEfeTO*8#1(8`^ zgM{zIO(z(ZI}L5OA`8h+YuWiYqr4bsiE2)y8n2@_x-WGS?eO`vB;pYS%`F6yU50lY zjG&7y?c3*dW*FV&p(nXJzwlck^=j?8qFTP@Z|pdpa6v4%2zpJxXhzX9d zJ;KBUjG%vSG%y^D4g%l|LOZr+*}n9*7WjAa(RW|F}HTQ1sFk z#!f(T=zFUkM;>@q{l+CR|9)J3pA2CGX9H_Hv+tA^kbe4y;d^Y`&*2jUF#nhwAv;H7 zpm;zFnET-`unpf!dz=LTy^@Qe^CM3tV_|Fcz>FCIIV&bsHacb|04w7I;RPHM209i1 zD-$yvko@{Z6g^h(v%Vi?8aZ3o*#Zj%(5om)e*XX-rhh-pj|mUYQP>y&g$;IGPA>NL zc8<;h2KM&WCUi!2Hk5xOoF0bfXBC0~mLGHYf9Vn%D;+BnfQ6a!p+}6IoOEm)02Vfm z-#heJ{Lk9{Nr(P%;8`9Le(#W_ow+TYv7L#4iMxT#KlbU5o9vPG`!x~DjxHwOYkx3g zzTXsl_#;d$=xAYJO)F$)ZTwvMJ6kAX;$-A#Vef3`2*>hs9owncT0Ck*fSl#yb`-O4 zbaECpH*kDd>+~`PKQA}|^lBEy&gKsbk%^rHzz*CEY;5ds?7x0LrvJV!AItnXNIx_v zOl-iC{>-HMo#)eu8?o$WKnXm25zbo>qCa<)tB*iRx-9JTb{@JizYcY(I(8#qsqGo* z^Xw2#&t^Oe*TnSQtxu@}bIqB`-F`2O>b5*ITMT#H8C2IHEpoe}6rV?xq2msZa6vz{ znwL{f@~61*H9_*uABoT4@(!(Iub6wnJ-)g7IZvvF<8tKLynv!Irp1B7g7mKLd>p0p zv`yIryTL-&x!bCfi;M(>xOY?@*+`8Q=FKz`jTB|ciDMCj)?J0mJG^(_T&o94n52g8 zj7W!aK!gOM8^MJXpGKoz3cxJzt%AaUx}nAhz_^FHg9#~OF_B(*F$aggZ$>V`I%G8a zWUncRKce2|ctqN9cW=yPu=_eexeB{@3h&pKVzK8z9sA=ZnGG!0ds} ze&mpafYddd7A+ePk7=3Nfut`31CStQU}XdDen}vRSQrV~npv9wR|mbIlhFe)&dI>~ zK;eGBpk;a>ZiNl(B}^>L%$?!bfxI!jlCz18>cbNS?!UV_i6Jmev^2daqo&Ub17y%p{z`p!gUu?h~_Al30 znc{*ph5`n^nd!k^co4U(rNxA`7FnpsL4A(;CiI2kL_jE(wTz{0?d~U2Lp}+4LBH`2 zeQ3&x5U3CuuNpC7cro(iuy~Ct0TkiLLP2b!i~FA~>U^@8oyPikx7-e{7E;<~6Y1=` zSYtuCh{fn<{lSf;n!DA;%+?q|WCGtsjVz0u%SQ$vd;JHG0M` zzLqy+ed$A$XDT}j`)WZM3>zP$s)49Wt+3#d|24lGTg64P zQ^cqBcb}k?Pg{iN>je@} zC-mml-~`>0wYFwjCRmCw;u9E^b>S>HlDu!6s)eCPjTtCL2O~_EXd_aEK{yqu1AcBu zF+Ri;&nG@b347^e^0HinX+brzUR*I#Kgc-*45bd`-jaY#uozus^&Tnd)|bj7Dql(v zC|nww{p9FRHN4@WCt|cRT~$x)dgZPOeXcns?`5bsYpO%tj#w`AZw-z_o%Z_B*smr} zbR5L+1vCsK2Oy6uGQjY9bsh3YH4_vqbS|pa@qB8A=*tw(83g;@rhelno(`^xBporG zAQvocNf@kRmXJiYgq_M~b3U!uR!FL)TF0%?@U~)|pcUYD&;re*HK4%_bF(vHXoR}a5J5OxKe=H@yr_4g}HZYk$nns^`{u3UcgdP;VOsZ)+Q#~fR>WS4Pzq;dHtD`yt3jT=&tU@QEJbhkO z3!iHzj2eqbcRINy#W}Rwi&o!1Zt8~v_~)|zwWohOa{m(PMgIlTf47|Im46}qce@0* zN&Xz^fw2E~NdI?;{`-ma6VZWTqJO#LbYeSgy8$R7hn~GpB41Y9 zB+xt)p|xBc?*77v+6`i#Mn(i#^&Or<3LP=-03L{3jUw zhiUxp!RYUX{4ZmV<#%)RCq}cd19AE1iUA%o|8m7pQ&K%QS=Kk#*EiLd7=2Eo4+7o? z0U%AxP)dQ?XaKP&i=67gfXQ4zj%9lf0w=KX+_zRr%vKUIfe_H9TcAxy*U*+q`3?$D25O)bXAF_r3vot(J}W~81q$T_6^n!5J8Nw=T0HhL31iIe zG*S2!I8Du$swkRJ2e#$nuN8Ve-{2VRF+k59y3WDQtLE=cjqI`Xzn<@^C4zu!LzNOr zFjRpCy{75Z7JEyB4E_WhSzZKnAf!nsK@cMkug3&ZNS~$>9215Lt{4nf3<@ilA8r`> z8%HJ6t*N{|Cx|j6L?nWc{GByI7Xm^wV7Yi-Jt5_F?ruV6{OS`hWeAX`?^!wjxXc~{ zy#F`Y5dRmj@xvqd1skm2ZS(&VY&<^uvz(U; z-w@-6lld2h2FLG)2GErRu(ATZ4<;tK$8Vrp`QT{vaAhCr`G1YVoY{G`&jRw@s_N2K6=z*hspeFhow-XEeX4 zoZ>Lxomo^rQ(VXcA^DGOplT~#Z!sM(Y4VLAv>`{(5E|iZO+Nf%RHKN)9S6GTL zjc^b)bmg6%bJ0}g{zhMRz{>q8f_zH=sbCL@>A=Lqa|K|cDXf?_q}Zp>3QPOVGU9Uo zx5B{d5jJeKa%F{NbO%fpe#%vN_Z>4OR zAkXV&=qmH)jWzH`#bYsi1CbI(0RQ9eecWh&+`a$gkobMm{^R)o0J{EctU%WvsPFpW z`ZIF^b!-2*>ks%-*B|bIM*0iaA4sD8rRxu1{Z2^zc0T+rnfqy00%^xz2Lubl1E%~d za;Z!`Oc;G;y4}t6Rihyo$ubxY80>)%iG!#^CnAKheK%yHmx3T=HN_1vAV*FTOG5go<^=L|3Fhuuk4dtnW`41qsNx)_%*a=C%4Jnazdbb z;Izos3h~<*2VVtD9wY+9}C zPCK`oQRx%ltq{&XCtL_PUzWHJ1zGd1(Q>wFx%X)?%wmP2EqK=%BWTD1H(^HTbEZ4{ z>0&^x*2Sbf%@+XEx!|`%=!~DHl8F&o^7VvG z6tdrXG)AI!Ju}4@(q;|!ZwfCGDrcB+nizoY6UIR|_}Ya762N!ryMk;4s?)vr?kSsj zdPl>1z>#Q>KQ#r480_cd&gN5mA}`gM_+gH zV2Y8o`Bxt$mNUIp3R0nn=L=v=9|f3ky-7p3+1F2py9C!i15dYaNb-tBQQ#kn!%N9Q zHHT5`cmmIYkO32mju)*g5ia)ZNo73_qCm7c1zw~Wu~m;?9;^aAiBM}NId2?pwYcj5 zz6LRAwJ4Ku&~+}~TY5(Uv@Bi}9#gP4&@N9@;FQI)iU!bWMLpi)*BRleu7PNWRWP8L z5}^<97`?1{YK^A#-g~UR#*QOtckKDvYLi`4CZuJjrC{6glg=mE`Sw)|vaiW`=t{BM zP*TFRMRr9za$FU}kws%9d2jum2!jiQ_yF=@FcJqyFYG(B1h(`qgtjn1cLV~O;HStG zzHqW1z$mPx1NrZ(>Ow>Z#q%pJAN0KYn^$2%fAPUr;+`oc5Bp0Q>@95+>6P18E@`l;m z2B8DAYZIZuiH-u=Qn$yO7uD6zG8LXQxhF%Utud#mbuEUW+!N1QA=bwYC2GzVO>o43 zqPmv}@{Q|J`1`zhtMp!dHu>KF&-^YqR%Sg@*W(_qU}kTr*)-bk-KvtPIv_ z$80qv4Vl|mYOMG}CvJK5NBb{^2Yve)#xpP!7A z%(%Aj$?_5U_t*mc_NI&+J5en&$4sF8MgYO;_JCKZ5glGqnNE|fE0iyE4sx_ZhW3YT zy45l)AhGRZqCD){n3qY8>)#Yqd(~ycn+%s&qOFO4d*_{>bzw%84t*5s?^OzbN5!I4 z@dSYp#mlml8mXuhMXSP6WwQPja~EjcPx*eOrrT4;OO?4CQSe1K_RQqdxch@K@0@14 z@fWt4*e-8r_{zhz^wGFJ8P&dzcxj~?*X}cW;R&A+fo{Li8{Rj1^S&hM4egwglJFgb z2c$Qbl#NKPMyPtWL^dPq$LpFxr@_;vRBspSq0G6@&^iaUhEiOkiyv3azDhBUbV6nc z`;2#j#hS99C{d;N%lq*=QTUUaZdS(}U{`1E`U}s5@bmesXsC;rKE2w#usKod+=;AQ zKHKZaDm|`oqmd{d&$s3sLc*^mKU+=@fd-EVMz9;Q{33*-H_#r64ON(5t{tB;J-1QS zeJY_Sg^9szRIiXyTBH-T=w$Gw|Lt2r12I5GQQ#Oa>E<)tA?XHPa_J+z*f05=sTI#g zoed{8ZnusdDu#*=8E8Y;`=(XNNRn0|R~scR%NnLHtIiO+d2M-nR=w7-Pf~BiWC{4k zqOVS5lpN^lr~-C6@NnM%UoA29g`|n(>Hv3@)Y95gc#1uZYNybgMKB@~0YEPDobu-3 zQ8)is8jCGc!szm}dzR9gQ_RwKs0e!A;-e5zVI}a*)=t|p%2(VDSWHE)k;N0y3`O@v zFoW%x3>vLx6c0%;i8d;7u@WO|ck5ld@uuo-lh%G^^v5h_d=)4}qjz5>bcw4Xbn zNBq`w2USewvC?)WUbTPRO!8$;OHt9oPoYR4?Un9~&{H1Er^P!l z3pJNA(pIp;RL%>>c0JCxARm$iGQzWp0gwqukkn=)>!pgM7*&ziTs2?6F?J{CHyoF= z+SUh5`ClX3Hu}k!C9QV7qd#%%=nF#Wmh=Qj#`%5v~2kW*+*P_?zm+%9X4aL&}7hcUDZ)y76=Esg&5 z&W>m?)Es*p-MJK=8PTyC4(Oeas!pGxwl!OnKPbx++NCq@I(%Eq{{h1E6W;< zqW+WSVndSV8JO^TahGL>L{F4+y z-;+!(Dk9&Q4yeV8^(DjEtd3W!)HF&y`SCD2+^?F-2tPOGMU&+Aa)9sc!!JZm7)~$4 ziqR$eo|`{kjp{|1+eU438iYUL9qc+rBz?a?@Ij(JYq5uAS*B;ENB^y*QWEn|*MKKfxe~B&1wMg~y4asg{zJ_9-GVG~s1AdlX+>gxJ6qUswKi z&}sP4pt)e{mRnkcPjz)eS_5n9;WtCA7_y5FWf?DJJgxWSiuF2 zG4KJt#B@R08g2=aBecf8&@9)W@_Py1dgnS3S;Rr!<&bo}J+e!olAgd^+)U!JY}OB}-ekmebuVQVrph z;UK8lwmj=A1fb9AE<>?p-9fFF4h!W5g*mW#UV_WwTNrVJZRxTJFgSmf(IkBG9?>rg zuu|~3~)VYxl4wmKY!?|H;C=sDRA+YMJ4Jqkt$IpY%tnBeYKU6 zc1ECYwnXjNlSlf5hJAfFeU;zLc#n`^JHBuA1}UHi>1nQMndT>vzTWQ4N>kNj8Z&mvZzum-=DmFweX7rg%x|tFBsIt^K((&x)Vx z#alIYyIc6$@kG zKrDJqOM~RowCaxCd7oQ`Bq+Il_M_4iYdLoJPNv*npHeK%e|Pd)`egtDFKqqCS##DG zbygGB2ixbnhijLq!;6DgF8J0b>hRSCV1%@Ce#xvVOHu1t0rIeH%12xhMTCI)6+`8~ z%YiMbL@ol@D9Y-&No_?Jsg=h8H0f`0Y2b*`Y;I zQRko3^cxp!O8Ps*r$-8(fV)qMk)@cV9aK=Aqg_rSBc@?`B`9Hi99BPENiYvJWg>p2 z?N*Ox9`y#vp%{$IYo!FUt)Uh`RlDn%<8`pQ9a7!i8ewwWzQsc_wrk1EY$=y-HqB(J zUdJ2E!&G~h;a}#kG@UfPyHwAj#BclxMHb2V`PCuV*MwzSPzh=N^#m%|+WJg;qSNuO z>75!gLhQVeda3-1jwECkvlCGhQ>?7@L)*}0)N-sVC)U~3XR^g zuf}=y#dxp=NN*(<&OO3>WwK0S@kBYB>h)Ybg-UDg%HO}_)#*o~RgZPIEm{}J{1O`| zpJF}LQF|)dMwEz10%W4CDvon}z88bGskGcrR3rIB)8t<+5R zm6o^iE#7+7?u!uw%)NYEYdgU?igXKN>DpU%oO{fUQ0PWEZ2~?X+Wo2R52oGtrP9cu zI$%SuS)0@`WKSFsn|XqjVnvFw=EugL3qu}?v3~I)f&~dH8oUCSk(1X}P+VeMPG2U8 zym)DsnMFT67JlR$ZR(0AyTLhnrW>2t8yTPs`a&SV3+ybi?tSDm9W!fRzvNf;`?`iI z{0ko!l2=N2lm64Rkp|U~X3IXBvugHY2yLC}h)`4Jb@qk&D&tE9ajh{!;z)S5K09QT z9h{ekk>ql?&@zi-_`E6$g4a0TqPA#=-mkRNQ}`(1oRQ}dR8ZfIJl({@TqQiHI`ZLx z9+mndh4T3R+aD>Ee}@2-{%ZpCyH?@1NGp&l{HL0l-wDtMGVus=rPfzcASUqTorAangMxv>9_f~+WnT?NF0Ff0+U zU%1ICQ)Cdp5Uvpbj5r`wbE9v(aJqE~A7iYkk1^JFSdTH**@|he_Np&^KBJ4W|IRVo(cZ3iVeGXevie+@>k{iAyJG2tgR{$$MmhY|liMtJ_hh*rige_}`cz5#zNtbfOjko!xF zVEdt5{)G{2KSHbj7mRp(_V3pKFo5}Qgh~E?rr3HM^`GPPYbyT6-R z?1;a#BY@WRf9PxcK7;=~81db0{S71j=!^ajBmNVN0Q-dzD~x-8f)T%Oz+Z>Yzk?A< ze~l5}UB=(|2;gf>|BR2|1j;e~=4<>EQvo$ja1T1C2NBhS@aa(i^;4MjAclGnP5pNL zkoU*CK>#+OKK0?52SL`OP6}Ao!}srkte?Msim`rP|6b2yTYws?hdhtve6RPrAPZO@ z2M6ch^jAN{Uk~|ymibWD!}UWsk7WSc^(f%_S=jGyc0CT#PrUvChO8VM|Js+77W zKz^3QomM%;KTEJR)}Imv>(ut%O8#DFf$`8y=w-TDIe2sEUcorxIP&&gd1{{|G>#95 z5c8i2y@8AOKRkJU$A$l%ljl1b^4rPtyMy)9gJtCazG3k5T>{{Hng52EvP|7WTdBVx zv+0rylpRgT|A`O`2}OAzilBl+CuUF%OekC<2&gs)Dx`?3yvRg2H}Dcq0V^?-Fj3P+ z)Kbf;>}}nrLY14T>2KC$D@cbj9n)hso{K*8SHn7{qi;B!k`6hY;?bsQgt5$@8KP>U z5mv#a_zbf&Vv6^cX)T_Qj*3d);kMEkTFh|o_%@5PI6Jo*xixZ&_LJv3w?lW|H8dtl zzw1vST=w$v6p>>0aVA_OP!;K+-_1$R%stc@yN^*rXz9z*Ct)}>j>U-*s6FhC;e)rU z<%L5xp{d8anKtwd^2l2?J4d-5-}KS?)|M25_ACO#z3$eGZWul4SsT>Oso`9YzGihzTfcNFZ`NO`?dsM!r0K%0?hBSC;0fy zQOCiyxyRWwW^)t$vt)Hq0;M2l#vRQ~jEFs^S2+3g}Tr%Jg z!g_5SS;Ibuac0iiaDch$Ucmk&^O`>BC0|=GTvj{C!CVJ*=7nII%pt&uE;0&QliYnQ z*o@j7ZkG9AT8N%RI zeiG4B<|w7leVD!?i#C0Ed4_b^om)wd)DvM3%NKmebcaiT-i`BGRkaVQ=EVi3SudnL z=V#pVsF%a7tL}aQ9l|Jd98_PQ55Rov3hS~rXg#VLkYbTdAo)bg6#hwMOys`GzUs9} zoS}k^!^q^2qE*v4(}xq5(qZ|2H2DT2LC3sU3X)O#PJnv|_{$d_K4^C#c!@;Ec3xP~fI)rr`0hsMpph%%{CB+L!i6VBWf4ejFG>U1T z=58o2D+m&P=N!80)eT}^Qm?Ub?|%?#Dl#O=0|{B#6K&z{ zrh6LNiS1yrki=lWkZ$#+gYpuYg$b2_h|3RPph3ldbhkoWlu~W)nJ@X$8eR%=jd${F zRLlm2Py(C``dJAtkF(jQ+pT7g7nL?PCF~`)*P=8aQiPt-6U@gLEBTs1!y&klW2|hG ztWoxZ&`%kJJaFba`f*q)PO)W)8zHVM18@zkENPlpts(aERV90;G|{T0wkgm?BU)W3 zn&>QYzNuch?MkH;ANbsbf5BJPnBiQHF2u*Y#}|Pdt5f0AJ)wS2MTR`> z4Kr%zuOi?bq)3brUJGaoW@P>PIkkEBIFy9WMLYFwuxJJwd|y3~19e=;nEO2j6@h#z zfyTD^qDwNfUuPh(>{PaqbvpHBTluH>)H*_T$t5qR-7$-PO{%W{r@bqWhjRP>vSukn z_N{E8nAOZ6OR^=|MJT&$V^{VqWV^0COXSLyvLyQ!$`X+!O9>HSq_Gtdzh~U~S>|@T z_x|^Ny?S2$nUBvo&pGFF=Kaib&ilMSq|8*3rilix946Zq2ieyVO7?mBn@c5 zVYA0`mjw7j;0DHBV(#S%jU2995yWC}A^KGCo{gvbP z=^8xK)YbGvN0tfZGH+!(WeJtfjw?FS5x0oy@MphRC`xncve%8%z2J$10+vV%eoo4V zn&AS(3VA+~3GBoP0o27t(mW-!-3}?&WXkx3^BdG)^{rV|w<_DyZ^vpnk?Zr=t#vU< zjGP#Gw07?4NEK>gvg0Gaq7&2FQi=_)tCaf8hS4GB>6XJqWX80s2v>rPR-y|1KzTp$ zn47Lxz82A~S=H9$+4aM>IauF8MPe4LezP7M@VUKob7R22(K48<27O^+?ncwmJd91n zW63YkKHM#}RrNfn8!I2O6Irq6Q-kwGE|$7T=<3or71&<0J5{Pld{s#pDero=mO18z z+TG=%vVOUuNV#{D=c)7gL@`lk=4&&q@)z@EJAbBHC8;Y*H*w^e%ix@AU~p?jxV~_?$VYp)H$3t3ljisK$9zX>l{Zg@c7#WEo-*Lj9)o-98(G93 z6PjQWH^QR7W_$_>N%+Vjv3yR(nZI$hr@#F=+U3Hhao#l)`)!}sQ*3W;X=}=@oU>Cf zXF0p_%DmPwqTf*ED8)76aXsaZ;+)3$J4@&a1M}Z=7j(_3R;pt#w8*iq97daAK zcLlQ-RT?)Ssx4pa0!d$61s-vVqLY118RN4qI1gJVImMp0_EkYcaYQA#+U0az z1bYDL0iIkDEY4B?WF$6U1O%amazr=9+arZ+7bYxX|KbIaZDL6IybF<57DA`>Iq_Z54;@lCALr9{TK zFShFx;M)>+OKEOu()I}|lB1N{1aJ z_FtYgWXt|>In&fzN)JIXe%1ZGNPtE_$sKcZq-1#1zTC7HY-BEWKOH zw!r2?UTB|~>kBK8{Y)ND{_Y4XsoM_F@ARJueF#teoVm#IU= zi3$iSs6knpd6ngBQPJ$u_EQ1+J8tj7|ux=4dap44f9JROXmC@Hw-I;iP!c z3-~D1wSkwd--l5Iewk0<6>D45DQ<@OmNeIjmcvdV1mLxEr?WYyk5kbsUc_oi+vcHZ zh=hkq1qkTFM(H0vpDbWzwVU)4h9%@y+$$=nW%P}%d&u`f%9bjW@@D8O>{Rcw`Xyp{ zA=#BYr>^c*G`4xgJ)3BtmD3Z;z7Te5iJ$L6+L5eBkq$4lh4tOttvVt*BvZnnj2+!q zO$?i1e1(-sx0cz=2=$QM(>3{3Ari##$#H{@DXj!D_JMD|a!P9+rky$n6CGn<2|Xk$ z^LuxIK($8L4N7~xbLU6B!|?BkXVJ-)OR?m8>0(AN%j3a9G>U-`klbRAXu4d`L%f zFxrESiO5AHvAZNlgCwbf`VJ|{Auf{^VY{9IQB{@H>`1$)#$*8!X3X4$6~hHrW0}bM z!JhtN_d9m);^U3c3*hdfg4J{GbuG@8mYroYz18)x&QV93YGidj8`io{-MGozr0D~l z^oVMvHXsc%+>k#{c?Hzsc`6=4efO@qNr6^ISfAo!=CLQ-!OzY=e>JF~)R822Glbnn z5y`Gs?Lp}7oPkPkOurc8HQt*1j*gIf%MFwIV1Ta8X7TF*0}scMsj~jU!ZQEO4e|71 z5g9VI!~<)WOM4UX z;R#|3h)p}$lV)!>%zDB}!EzA?pMy@C8wwPsiZ&6kjeH%bVtog<<%dGdsU9k!j+K<(4b9lT+?zqUN7~JgHhVj?lyJCKDu=DQ@!c7izSH=~n$1jq$nOI< z+|!tGk$X*lPR%BwS$v64jx$S=FX)5qjloA*t(kq+)8zEF1lMIU)#cj9?#LdVR&Ii)|r`5hlnd6i>B`4#-gDS`}Qs0a5a4 z3RZ68w-mh}x-tr3u>EAQ-q^Ih_Vm2FkF}Abm~WbODFIW%paPuALGbF;t7AU3?!7z{ zTG?*87Ck**%S#u)$+ezhm zb#uY<1_Vx&+IMWS^w3Gn5w>~DhK(tu*agvNZ+@GNy*6ZDaal*Rkczc@wl{bzs{~Su zeAzh^&K($8YIz;*%!Mu<=xo{$&KBl<>T6H#TdpsqYRvBbS?rCYrNx_4MW3h1WN#-o z$2ktkn&j|uOlVbG=iFa(R(?;HRC9De^0hq{#;Hm=sz#JPf>j~ubbDkFlIcHGDOKRX z&ndjI!I6-&S+#b|eMl=Y{o=*+lq)TY#Z|~8iSgGy&l>|BKC(VQ)UDO5h`scd41G)g zc;J;6slTZ*<4M_Mf{%gJv}*EObm40uwPgy7UJtG;f#5C}hqmr_aT6^uM^s~(8iG^~ z9FV(REfi|CbW&vG2?x5m$D?}X3;XP`%AyW!U75wg?CKh-&Y76uFh}qN<4G2T7XT(Gm zmI$6#QSKJ|SW#v9Ti5h4UjG@!GrTz-yv=uKCKqXmoP_5o!hi!DOTwQxm{N*tAC^BX zEP1`$F{C+AIaqx`IdXPgX^>xRQd-=Knm_q>#it=aI4)YLD{@_G5~cNPQ39oq$O@yhAaE0obMh;IaI7@Zj^Do?-T z6f!&-$)jc&IMF1Z`|R{IR9y1Ir$3&d#?ua+<`p6&3~PbS+23N2Qb=jFeUB-RNM@m_ zDD_*ofhF+Kq2a72k^)aNn;tsRB%6~d(|9jKxbv7NhxTDkWMAdxVS$De&PSfn?;|Mr zd{x;|>ZZqQR|~NbG!JQ3Pbd<2X=@&%(YTi#bIX~mUr|6dkCgH)HrJ)2h#BfK2V3>v zomLvx=co{SRlPYZJ?@n%pG{I3n`>Ph-s`6;B;FY`TEYiCl@tp1Ru>2J&`1`l>pjTm zPXAqJTymLldN!7Zj`ls`voAr&{mL`=EBD({ao<;q|J};H_77L?5#OX1c6Rp=-#7vP zWp{78?Dr=A+THu%?%wvOe;=ovQ2q@AV0Q%nXVzE&K?U3<(9Rm`w&8ofzpt_WVhR0o zme861-`&$C0b~CHBX(0Pb|%(8+1KDT@3jeqay;1iWnhv_4>{aPj!L9|Q^qz8F9ELm+T^ z;y?F8B2fRx4~4`XjoMou6oFF@-)lp{5O{vu9((;DAQbM@+}^T(Z8&k|{Wch&oW9=%N8n@$6Y(~9tFW=-`i_LA_0cnJ{t;# z*gvLVkSJc8!5}D(`L?G%V5IP31O_8;0@Qo`z%bbUaRCFm>HGT&1`^QY*&sMp-oE;v zaQyZm!KnRX3PvDNcx^_)a2o4-+lPYS%|9>_h4;M#qX6vNR~7<-;>8ls$pTUa?7asB zP|U_F3x?ue%J#Gc0tVs52m%(x8v_Uwgf};Ua~rtDoW1oyftkPGhJ=Ig+6T-@y!Igx zxFf22>qEiu<^co+1PRX%i5E*K2&W>xw?3d7c>REaka*t(C>Vmv>F~WRE>3{nzqu0y zF7KtLm4`X-OaT%O*xS2sZ6`a>0JO=`7WP1*gg+UrN@oFCdI^-dg&EWYghYZ(QD9>j v0s(?qK%sCGQv?zYH5WCJr1-hZpNDXrU4RGX_R|ashN8d}yu5O%@)Z9E4yjLZ From 5d739b9639c698a380974055dcd25ba8f499be01 Mon Sep 17 00:00:00 2001 From: pooja holkar <37286638+PoojaHolkar@users.noreply.github.com> Date: Mon, 25 Nov 2024 17:35:03 +0530 Subject: [PATCH 18/38] Delete examples/notebooks/PII/invoicedata/test.py Signed-off-by: Pooja Holkar --- examples/notebooks/PII/invoicedata/test.py | 1 - 1 file changed, 1 deletion(-) delete mode 100644 examples/notebooks/PII/invoicedata/test.py diff --git a/examples/notebooks/PII/invoicedata/test.py b/examples/notebooks/PII/invoicedata/test.py deleted file mode 100644 index 8b1378917..000000000 --- a/examples/notebooks/PII/invoicedata/test.py +++ /dev/null @@ -1 +0,0 @@ - From 006038ac411942cf9fd81caac936adb549121026 Mon Sep 17 00:00:00 2001 From: pooja holkar <37286638+PoojaHolkar@users.noreply.github.com> Date: Mon, 25 Nov 2024 19:45:26 +0530 Subject: [PATCH 19/38] notebook recipe for PII redaction code The notebook is a Kickstarter for using PII redaction transform Signed-off-by: Pooja Holkar --- ...un_your_first_PII_redactor_transform.ipynb | 196 ++++++++++++++++++ 1 file changed, 196 insertions(+) diff --git a/examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb b/examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb index 13cdae5f8..c671ca139 100644 --- a/examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb +++ b/examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb @@ -4,19 +4,37 @@ "cell_type": "markdown", "metadata": {}, "source": [ +<<<<<<< HEAD "Extracting Text from PDF and Configuring PII Redactor" +======= + "## Extracting Text from PDF and Configuring PII Redactor" +>>>>>>> 25a65d81 (notebook recipe for PII redaction code) ] }, { "cell_type": "markdown", "metadata": {}, "source": [ +<<<<<<< HEAD "What is a PII Redactor?\n", "A PII (Personally Identifiable Information) Redactor is a tool or system designed to identify and redact sensitive information in text data. PII includes details that can be used to identify an individual, such as:\n", +======= + "\n", + "**Author**: Pooja Holkar ,\n", + "**email**:poholkar@in.ibm.com\n", + "\n", + "\n", + "\n", + "\n", + "### What is a PII Redactor?\n", + "\n", + "A PII (Personally Identifiable Information) Redactor is a tool designed to identify and redact sensitive information in text data. PII includes details that can be used to identify an individual, such as:\n", +>>>>>>> 25a65d81 (notebook recipe for PII redaction code) "\n", "Names\n", "Email addresses\n", "Phone numbers\n", +<<<<<<< HEAD "Physical or shipping addresses\n", "Financial details (e.g., credit card numbers)\n", "Use Case in This Project\n", @@ -27,33 +45,89 @@ "\n", "The text from the invoice (a PDF document in this case) is extracted using the pdfplumber library.\n", "Redactor Configuration:\n", +======= + "Addresses\n", + "Financial details (e.g., credit card numbers)\n", + "\n", + "### Overview of the use case\n", + "In this usecase, the PII Redactor is applied to text extracted from invoices to ensure sensitive customer information is not exposed during processing, sharing, or storage.\n", + "\n", + " **Workflow Overview**\n", + "\n", + "The text from the invoice (a PDF document in this case) is extracted using the pdfplumber library.\n", + "\n", + " **Redactor Configuration**\n", +>>>>>>> 25a65d81 (notebook recipe for PII redaction code) "\n", "The system is configured to recognize specific PII entities relevant to invoices, such as:\n", "Customer names\n", "Email addresses\n", "Phone numbers\n", "Shipping addresses\n", +<<<<<<< HEAD "PII Detection and Redaction:\n", +======= + "\n", + " **PII Detection and Redaction**\n", +>>>>>>> 25a65d81 (notebook recipe for PII redaction code) "\n", "The redactor scans the extracted text and applies redaction rules, replacing sensitive details with placeholders.\n", "Output:\n", "\n", "The redacted text is displayed alongside a summary of all identified PII entities for auditing purposes.\n", +<<<<<<< HEAD "Why is PII Redaction Important?\n", "Data Privacy Compliance: Adheres to regulations like GDPR, HIPAA, or CCPA that mandate safeguarding customer information.\n", "Risk Mitigation: Prevents unauthorized access to or misuse of sensitive data.\n", "Automation Benefits: Simplifies and accelerates the process of securing information in large-scale document handling.\n" +======= + "\n", + "### Why is PII Redaction Important?\n", + "\n", + " **Data Privacy Compliance**: Adheres to regulations like GDPR, HIPAA, or CCPA that mandate safeguarding customer information.\n", + "\n", + " **Risk Mitigation**: Prevents unauthorized access to or misuse of sensitive data.\n", + "\n", + " **Automation Benefits**: Simplifies and accelerates the process of securing information in large-scale document handling.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Pre-req: Install data-prep-kit dependencies" +>>>>>>> 25a65d81 (notebook recipe for PII redaction code) ] }, { "cell_type": "code", +<<<<<<< HEAD "execution_count": 8, +======= + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "# !pip install transforms\n", + "# !pip install pdfplumber\n", + "# !pip install flair\n", + "# !pip install spacy\n", + "# !pip install presidio_anonymizer==2.2.355" + ] + }, + { + "cell_type": "code", + "execution_count": 2, +>>>>>>> 25a65d81 (notebook recipe for PII redaction code) "metadata": {}, "outputs": [], "source": [ "import pdfplumber\n", +<<<<<<< HEAD "#from data_processing.transform.table_transform import AbstractTableTransform\n", "#from data_processing.transform import AbstractTableTransform, TransformConfiguration\n", +======= +>>>>>>> 25a65d81 (notebook recipe for PII redaction code) "from pii_redactor_transform import PIIRedactorTransform\n" ] }, @@ -61,11 +135,20 @@ "cell_type": "markdown", "metadata": {}, "source": [ +<<<<<<< HEAD "Step 1: Extract Text from PDF" +======= + "### Step 1: Inspect the Data \n", + "\n", + "We will use simple invoice PDF\n", + "\n", + "[invoicedata](https://raw.githubusercontent.com/PoojaHolkar/data-prep-kit/refs/heads/dev/examples/notebooks/PII/invoicedata/Invoice.pdf)" +>>>>>>> 25a65d81 (notebook recipe for PII redaction code) ] }, { "cell_type": "code", +<<<<<<< HEAD "execution_count": 9, "metadata": {}, "outputs": [], @@ -114,20 +197,52 @@ "source": [ "!pip uninstall numpy --yes\n", "#!pip install numpy==1.19.3\n" +======= + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "UsageError: Line magic function `%!wget` not found.\n" + ] + } + ], + "source": [ + "!wget 'https://raw.githubusercontent.com/PoojaHolkar/data-prep-kit/refs/heads/dev/examples/notebooks/PII/invoicedata/Invoice.pdf'" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "pdf_path=\"Invoice.pdf\"" +>>>>>>> 25a65d81 (notebook recipe for PII redaction code) ] }, { "cell_type": "markdown", "metadata": {}, "source": [ +<<<<<<< HEAD "Step 1: Extract Text from PDF\n", +======= + "### Step 2: Extract Text from PDF\n", +>>>>>>> 25a65d81 (notebook recipe for PII redaction code) "\n", "This step uses the pdfplumber library to open and read a PDF file. The code processes each page of the PDF to extract text and concatenates it into a single string." ] }, { "cell_type": "code", +<<<<<<< HEAD "execution_count": 11, +======= + "execution_count": 13, +>>>>>>> 25a65d81 (notebook recipe for PII redaction code) "metadata": {}, "outputs": [], "source": [ @@ -140,7 +255,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ +<<<<<<< HEAD "#Step 2: Configure the PII Redactor\n" +======= + "### Step 3: Configure the PII Redactor\n" +>>>>>>> 25a65d81 (notebook recipe for PII redaction code) ] }, { @@ -153,7 +272,11 @@ }, { "cell_type": "code", +<<<<<<< HEAD "execution_count": 12, +======= + "execution_count": 14, +>>>>>>> 25a65d81 (notebook recipe for PII redaction code) "metadata": {}, "outputs": [], "source": [ @@ -170,7 +293,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ +<<<<<<< HEAD "Step 3: Initialize and Run the PII Redactor\n" +======= + "### Step 4: Initialize and Run the PII Redactor\n" +>>>>>>> 25a65d81 (notebook recipe for PII redaction code) ] }, { @@ -182,6 +309,7 @@ }, { "cell_type": "code", +<<<<<<< HEAD "execution_count": 13, "metadata": {}, "outputs": [ @@ -190,13 +318,44 @@ "output_type": "stream", "text": [ "20:33:16 INFO - Loading model from flair/ner-english-large\n" +======= + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collecting en-core-web-sm==3.8.0\n", + " Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)\n", + "\u001b[2K \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.8/12.8 MB\u001b[0m \u001b[31m9.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m[31m9.9 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\n", + "\u001b[?25hInstalling collected packages: en-core-web-sm\n", + "Successfully installed en-core-web-sm-3.8.0\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "17:45:46 INFO - Loading model from flair/ner-english-large\n" +>>>>>>> 25a65d81 (notebook recipe for PII redaction code) ] }, { "name": "stdout", "output_type": "stream", "text": [ +<<<<<<< HEAD "2024-11-24 20:33:33,105 SequenceTagger predicts: Dictionary with 20 tags: , O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, , \n" +======= + "\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n", + "You can now load the package via spacy.load('en_core_web_sm')\n", + "\u001b[38;5;3m⚠ Restart to reload dependencies\u001b[0m\n", + "If you are in a Jupyter or Colab notebook, you may need to restart Python in\n", + "order to load all the package's dependencies. You can do this by selecting the\n", + "'Restart kernel' or 'Restart runtime' option.\n", + "2024-11-25 17:46:04,004 SequenceTagger predicts: Dictionary with 20 tags: , O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, , \n" +>>>>>>> 25a65d81 (notebook recipe for PII redaction code) ] } ], @@ -209,7 +368,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ +<<<<<<< HEAD "Step 4: Apply the Redactor to Text Data\n" +======= + "### Step 5: Apply the Redactor to Text Data\n" +>>>>>>> 25a65d81 (notebook recipe for PII redaction code) ] }, { @@ -221,7 +384,11 @@ }, { "cell_type": "code", +<<<<<<< HEAD "execution_count": 14, +======= + "execution_count": 16, +>>>>>>> 25a65d81 (notebook recipe for PII redaction code) "metadata": {}, "outputs": [], "source": [ @@ -234,7 +401,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ +<<<<<<< HEAD "Step 5: Display the Redaction Results\n" +======= + "### Step 6: Display the Redaction Results\n" +>>>>>>> 25a65d81 (notebook recipe for PII redaction code) ] }, { @@ -246,7 +417,11 @@ }, { "cell_type": "code", +<<<<<<< HEAD "execution_count": 15, +======= + "execution_count": 17, +>>>>>>> 25a65d81 (notebook recipe for PII redaction code) "metadata": {}, "outputs": [ { @@ -290,11 +465,28 @@ "print(\"Redacted Text:\\n\", redacted_text)\n", "print(\"Detected Entities:\\n\", detected_entities)" ] +<<<<<<< HEAD +======= + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "
\n", + "\n", + "### This notebook effectively demonstrates how to seamlessly apply redaction for PII entities." + ] +>>>>>>> 25a65d81 (notebook recipe for PII redaction code) } ], "metadata": { "kernelspec": { +<<<<<<< HEAD "display_name": "data-prep-kit-1", +======= + "display_name": "Python 3 (ipykernel)", +>>>>>>> 25a65d81 (notebook recipe for PII redaction code) "language": "python", "name": "python3" }, @@ -312,5 +504,9 @@ } }, "nbformat": 4, +<<<<<<< HEAD "nbformat_minor": 2 +======= + "nbformat_minor": 4 +>>>>>>> 25a65d81 (notebook recipe for PII redaction code) } From 2dd4c6d62be563d2259640876f07341883d24c61 Mon Sep 17 00:00:00 2001 From: Michele Dolfi Date: Wed, 13 Nov 2024 16:41:30 +0100 Subject: [PATCH 20/38] update pdf2parquet README Signed-off-by: Michele Dolfi Signed-off-by: Pooja Holkar --- transforms/language/pdf2parquet/README.md | 8 +- .../language/pdf2parquet/python/README.md | 117 ++++++++++++++---- transforms/language/pdf2parquet/ray/README.md | 50 +++++++- 3 files changed, 148 insertions(+), 27 deletions(-) diff --git a/transforms/language/pdf2parquet/README.md b/transforms/language/pdf2parquet/README.md index 14373a68c..89a53147d 100644 --- a/transforms/language/pdf2parquet/README.md +++ b/transforms/language/pdf2parquet/README.md @@ -1,10 +1,10 @@ -# PDF2PARQUET Transform +# Pdf2Parquet Transform -The PDF2PARQUET transforms iterate through PDF files or zip of PDF files and generates parquet files -containing the converted document in Markdown format. +The Pdf2Parquet transforms iterate through PDF, Docx, Pptx, Images files or zip of files and generates parquet files +containing the converted document in Markdown or JSON format. -The PDF conversion is using the [Docling package](https://github.com/DS4SD/docling). +The conversion is using the [Docling package](https://github.com/DS4SD/docling). The following runtimes are available: diff --git a/transforms/language/pdf2parquet/python/README.md b/transforms/language/pdf2parquet/python/README.md index a4bd31e06..aaf56669f 100644 --- a/transforms/language/pdf2parquet/python/README.md +++ b/transforms/language/pdf2parquet/python/README.md @@ -1,4 +1,15 @@ -# Ingest PDF to Parquet +# Ingest PDF to Parquet Transform + +Please see the set of +[transform project conventions](../../../README.md#transform-project-conventions) +for details on general project conventions, transform configuration, +testing and IDE set up. + +## Contributors + +- Michele Dolfi (dol@zurich.ibm.com) + +## Description This tranforms iterate through document files or zip of files and generates parquet files containing the converted document in Markdown or JSON format. @@ -7,6 +18,9 @@ The PDF conversion is using the [Docling package](https://github.com/DS4SD/docli The Docling configuration in DPK is tuned for best results when running large batch ingestions. For more details on the multiple configuration options, please refer to the official [Docling documentation](https://ds4sd.github.io/docling/). + +### Input files + This transform supports the following input formats: - PDF documents @@ -17,32 +31,33 @@ This transform supports the following input formats: - Markdown documents - ASCII Docs documents +The input documents can be provided in a folder structure, or as a zip archive. +Please see the configuration section for specifying the input files. -## Output format -The output format will contain all the columns of the metadata CSV file, -with the addition of the following columns +### Output format -```jsonc -{ - "source_filename": "string", // the basename of the source archive or file - "filename": "string", // the basename of the PDF file - "contents": "string", // the content of the PDF - "document_id": "string", // the document id, a random uuid4 - "document_hash": "string", // the document hash of the input content - "ext": "string", // the detected file extension - "hash": "string", // the hash of the `contents` column - "size": "string", // the size of `contents` - "date_acquired": "date", // the date when the transform was executing - "num_pages": "number", // number of pages in the PDF - "num_tables": "number", // number of tables in the PDF - "num_doc_elements": "number", // number of document elements in the PDF - "pdf_convert_time": "float", // time taken to convert the document in seconds -} -``` +The output table will contain following columns +| output column name | data type | description | +|-|-|-| +| source_filename | string | the basename of the source archive or file | +| filename | string | the basename of the PDF file | +| contents | string | the content of the PDF | +| document_id | string | the document id, a random uuid4 | +| document_hash | string | the document hash of the input content | +| ext | string | the detected file extension | +| hash | string | the hash of the `contents` column | +| size | string | the size of `contents` | +| date_acquired | date | the date when the transform was executing | +| num_pages | number | number of pages in the PDF | +| num_tables | number | number of tables in the PDF | +| num_doc_elements | number | number of document elements in the PDF | +| pdf_convert_time | float | time taken to convert the document in seconds | -## Parameters + + +## Configuration The transform can be initialized with the following parameters. @@ -58,9 +73,67 @@ The transform can be initialized with the following parameters. | `pdf_backend` | `dlparse_v2` | The PDF backend to use. Valid values are `dlparse_v2`, `dlparse_v1`, `pypdfium2`. | | `double_precision` | `8` | If set, all floating points (e.g. bounding boxes) are rounded to this precision. For tests it is advised to use 0. | + +Example + +```py +{ + "contents_type": "application/json", + "do_ocr": True, +} +``` + +## Usage + +### Launched Command Line Options + When invoking the CLI, the parameters must be set as `--pdf2parquet_`, e.g. `--pdf2parquet_do_ocr=true`. +### Running the samples +To run the samples, use the following `make` targets + +* `run-cli-sample` - runs src/pdf2parquet_transform_python.py using command line args +* `run-local-sample` - runs src/pdf2parquet_local.py +* `run-local-python-sample` - runs src/pdf2parquet_local_python.py + +These targets will activate the virtual environment and set up any configuration needed. +Use the `-n` option of `make` to see the detail of what is done to run the sample. + +For example, +```shell +make run-local-python-sample +... +``` +Then +```shell +ls output +``` +To see results of the transform. + + +### Code example + +TBD (link to the notebook will be provided) + +See the sample script [src/pdf2parquet_local_python.py](src/pdf2parquet_local_python.py). + + +### Transforming data using the transform image + +To use the transform image to transform your data, please refer to the +[running images quickstart](../../../../doc/quick-start/run-transform-image.md), +substituting the name of this transform image and runtime as appropriate. + +## Testing + +Following [the testing strategy of data-processing-lib](../../../../data-processing-lib/doc/transform-testing.md) + +Currently we have: +- [Unit test](transforms/language/pdf2parquet/python/test/test_pdf2parquet_python.py) +- [Integration test](transforms/language/pdf2parquet/python/test/test_pdf2parquet.py) + + ## Credits The PDF document conversion is developed by the AI for Knowledge group in IBM Research Zurich. diff --git a/transforms/language/pdf2parquet/ray/README.md b/transforms/language/pdf2parquet/ray/README.md index 5ef98f645..4db4b47c7 100644 --- a/transforms/language/pdf2parquet/ray/README.md +++ b/transforms/language/pdf2parquet/ray/README.md @@ -1,7 +1,55 @@ -# PDF2PARQUET Ray Transform +# Ingest PDF to Parquet Ray Transform +Please see the set of +[transform project conventions](../../../README.md#transform-project-conventions) +for details on general project conventions, transform configuration, +testing and IDE set up. This module implements the ray version of the [pdf2parquet transform](../python/). +## Summary +This project wraps the [Ingest PDF to Parquet transform](../python) with a Ray runtime. + +## Configuration and command line Options + +Ingest PDF to Parquet configuration and command line options are the same as for the base python transform. + +## Running + +### Launched Command Line Options +When running the transform with the Ray launcher (i.e. TransformLauncher), +In addition to those available to the transform as defined in [here](../python/README.md), +the set of +[ray launcher](../../../../data-processing-lib/doc/ray-launcher-options.md) are available. + +### Running the samples +To run the samples, use the following `make` targets + +* `run-cli-sample` - runs src/pdf2parquet_transform_ray.py using command line args +* `run-local-sample` - runs src/pdf2parquet_local_ray.py +* `run-s3-sample` - runs src/pdf2parquet_s3_ray.py + * Requires prior invocation of `make minio-start` to load data into local minio for S3 access. + +These targets will activate the virtual environment and set up any configuration needed. +Use the `-n` option of `make` to see the detail of what is done to run the sample. + +For example, +```shell +make run-cli-sample +... +``` +Then +```shell +ls output +``` +To see results of the transform. + + +### Transforming data using the transform image + +To use the transform image to transform your data, please refer to the +[running images quickstart](../../../../doc/quick-start/run-transform-image.md), +substituting the name of this transform image and runtime as appropriate. + ## Prometheus metrics From 5be31767b7b17d55a130c8e60d376d51ae7f156d Mon Sep 17 00:00:00 2001 From: Michele Dolfi Date: Wed, 13 Nov 2024 16:53:39 +0100 Subject: [PATCH 21/38] add data_files_to_use Signed-off-by: Michele Dolfi Signed-off-by: Pooja Holkar --- transforms/language/pdf2parquet/python/README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/transforms/language/pdf2parquet/python/README.md b/transforms/language/pdf2parquet/python/README.md index aaf56669f..d9dc2520a 100644 --- a/transforms/language/pdf2parquet/python/README.md +++ b/transforms/language/pdf2parquet/python/README.md @@ -63,6 +63,7 @@ The transform can be initialized with the following parameters. | Parameter | Default | Description | |------------|----------|--------------| +| `data_files_to_use` | - | The files extensions to be considered when running the transform. Example value `['.pdf','.docx','.pptx','.zip']`. For all the supported input formats, see the section above. | | `batch_size` | -1 | Number of documents to be saved in the same result table. A value of -1 will generate one result file for each input file. | | `artifacts_path` | | Path where to Docling models artifacts are located, if unset they will be downloaded and fetched from the [HF_HUB_CACHE](https://huggingface.co/docs/huggingface_hub/en/guides/manage-cache) folder. | | `contents_type` | `text/markdown` | The output type for the `contents` column. Valid types are `text/markdown`, `text/plain` and `application/json`. | @@ -78,6 +79,7 @@ Example ```py { + "data_files_to_use": ast.literal_eval("['.pdf','.docx','.pptx','.zip']"), "contents_type": "application/json", "do_ocr": True, } From 13a90ce6a07af343cd4165ac219e5b3b983dacdf Mon Sep 17 00:00:00 2001 From: Michele Dolfi Date: Wed, 13 Nov 2024 18:03:19 +0100 Subject: [PATCH 22/38] doc_chunk README Signed-off-by: Michele Dolfi Signed-off-by: Pooja Holkar --- .../language/doc_chunk/python/README.md | 57 +++++++++++++++++-- 1 file changed, 52 insertions(+), 5 deletions(-) diff --git a/transforms/language/doc_chunk/python/README.md b/transforms/language/doc_chunk/python/README.md index 9abca2b79..1ec3a8080 100644 --- a/transforms/language/doc_chunk/python/README.md +++ b/transforms/language/doc_chunk/python/README.md @@ -1,5 +1,16 @@ # Chunk documents Transform +Please see the set of +[transform project conventions](../../../README.md#transform-project-conventions) +for details on general project conventions, transform configuration, +testing and IDE set up. + +## Contributors + +- Michele Dolfi (dol@zurich.ibm.com) + +## Description + This transform is chunking documents. It supports multiple _chunker modules_ (see the `chunking_type` parameter). When using documents converted to JSON, the transform leverages the [Docling Core](https://github.com/DS4SD/docling-core) `HierarchicalChunker` @@ -9,20 +20,26 @@ which provides the required JSON structure. When using documents converted to Markdown, the transform leverages the [Llama Index](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/#markdownnodeparser) `MarkdownNodeParser`, which is relying on its internal Markdown splitting logic. -## Output format + +### Input + +| input column name | data type | description | +|-|-|-| +| the one specified in _content_column_name_ configuration | string | the content used in this transform | + + +### Output format The output parquet file will contain all the original columns, but the content will be replaced with the individual chunks. -### Tracing the origin of the chunks +#### Tracing the origin of the chunks The transform allows to trace the origin of the chunk with the `source_doc_id` which is set to the value of the `document_id` column (if present) in the input table. The actual name of columns can be customized with the parameters described below. -## Running - -### Parameters +## Configuration The transform can be tuned with the following parameters. @@ -40,6 +57,12 @@ The transform can be tuned with the following parameters. | `output_pageno_column_name` | `page_number` | Column name to store the page number of the chunk in the output table. | | `output_bbox_column_name` | `bbox` | Column name to store the bbox of the chunk in the output table. | + + +## Usage + +### Launched Command Line Options + When invoking the CLI, the parameters must be set as `--doc_chunk_`, e.g. `--doc_chunk_column_name_key=myoutput`. @@ -63,8 +86,32 @@ ls output ``` To see results of the transform. +### Code example + +TBD (link to the notebook will be provided) + +See the sample script [src/doc_chunk_local_python.py](src/doc_chunk_local_python.py). + + ### Transforming data using the transform image To use the transform image to transform your data, please refer to the [running images quickstart](../../../../doc/quick-start/run-transform-image.md), substituting the name of this transform image and runtime as appropriate. + +## Testing + +Following [the testing strategy of data-processing-lib](../../../../data-processing-lib/doc/transform-testing.md) + +Currently we have: +- [Unit test](test/test_doc_chunk_python.py) + + +## Further Resource + +- For the [Docling Core](https://github.com/DS4SD/docling-core) `HierarchicalChunker` + - +- For the Markdown chunker in LlamaIndex + - [Markdown chunking](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/#markdownnodeparser) +- For the Token Text Splitter in LlamaIndex + - [Token Text Splitter](https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/token_text_splitter/) From a78729c4e5ef393c292c003f85557bc43cea4f1a Mon Sep 17 00:00:00 2001 From: Michele Dolfi Date: Wed, 13 Nov 2024 18:03:30 +0100 Subject: [PATCH 23/38] text_encoder README Signed-off-by: Michele Dolfi Signed-off-by: Pooja Holkar --- .../language/text_encoder/python/README.md | 46 +++++++++++++++++-- 1 file changed, 42 insertions(+), 4 deletions(-) diff --git a/transforms/language/text_encoder/python/README.md b/transforms/language/text_encoder/python/README.md index 4c927d1ed..fa9c54ada 100644 --- a/transforms/language/text_encoder/python/README.md +++ b/transforms/language/text_encoder/python/README.md @@ -1,14 +1,36 @@ # Text Encoder Transform -## Summary +Please see the set of +[transform project conventions](../../../README.md#transform-project-conventions) +for details on general project conventions, transform configuration, +testing and IDE set up. + +## Contributors + +- Michele Dolfi (dol@zurich.ibm.com) + +## Description + This transform is using [sentence encoder models](https://en.wikipedia.org/wiki/Sentence_embedding) to create embedding vectors of the text in each row of the input .parquet table. The embeddings vectors generated by the transform are useful for tasks like sentence similarity, features extraction, etc which are also at the core of retrieval-augmented generation (RAG) applications. +### Input + +| input column name | data type | description | +|-|-|-| +| the one specified in _content_column_name_ configuration | string | the content used in this transform | + + +### Output columns + + +| output column name | data type | description | +|-|-|-| +| the one specified in _output_embeddings_column_name_ configuration | `array[float]` | the embeddings vectors of the content | -## Running -### Parameters +## Configuration The transform can be tuned with the following parameters. @@ -18,7 +40,11 @@ The transform can be tuned with the following parameters. | `model_name` | `BAAI/bge-small-en-v1.5` | The HF model to use for encoding the text. | | `content_column_name` | `contents` | Name of the column containing the text to be encoded. | | `output_embeddings_column_name` | `embeddings` | Column name to store the embeddings in the output table. | -| `output_path_column_name` | `doc_path` | Column name to store the document path of the chunk in the output table. | + + +## Usage + +### Launched Command Line Options When invoking the CLI, the parameters must be set as `--text_encoder_`, e.g. `--text_encoder_column_name_key=myoutput`. @@ -43,8 +69,20 @@ ls output ``` To see results of the transform. +### Code example + +TBD (link to the notebook will be provided) + + ### Transforming data using the transform image To use the transform image to transform your data, please refer to the [running images quickstart](../../../../doc/quick-start/run-transform-image.md), substituting the name of this transform image and runtime as appropriate. + +## Testing + +Following [the testing strategy of data-processing-lib](../../../../data-processing-lib/doc/transform-testing.md) + +Currently we have: +- [Unit test](test/test_text_encoder_python.py) \ No newline at end of file From e19d0d6a17a66308f974c5bfb9ab9ba6ebeb0b45 Mon Sep 17 00:00:00 2001 From: Maroun Touma Date: Wed, 20 Nov 2024 13:27:21 -0500 Subject: [PATCH 24/38] Added notebook for pdf2parquet Signed-off-by: Maroun Touma Signed-off-by: Pooja Holkar --- .../language/pdf2parquet/pdf2parquet.ipynb | 212 ++++++++++++++++++ 1 file changed, 212 insertions(+) create mode 100644 transforms/language/pdf2parquet/pdf2parquet.ipynb diff --git a/transforms/language/pdf2parquet/pdf2parquet.ipynb b/transforms/language/pdf2parquet/pdf2parquet.ipynb new file mode 100644 index 000000000..1ba814170 --- /dev/null +++ b/transforms/language/pdf2parquet/pdf2parquet.ipynb @@ -0,0 +1,212 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "afd55886-5f5b-4794-838e-ef8179fb0394", + "metadata": {}, + "source": [ + "##### **** These pip install need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release\n", + "\n", + "##### **** example for transform developers working from git clone\n", + "```\n", + "make venv\n", + "source venv/bin/activate && pip install jupyterlab\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "## This is here as a reference only\n", + "# Users and application developers must use the right tag for the latest from pypi\n", + "#!pip install data-prep-toolkit\n", + "#!pip install data-prep-toolkit-transforms\n", + "#!pip install data-prep-connector" + ] + }, + { + "cell_type": "markdown", + "id": "407fd4e4-265d-4ec7-bbc9-b43158f5f1f3", + "metadata": { + "jp-MarkdownHeadingCollapsed": true + }, + "source": [ + "##### **** Configure the transform parameters. We will only show the use of double_precision. For a complete list, please refer to the Readme.md for this transform\n", + "##### \n", + "| parameter:type | Description |\n", + "| --- | --- |\n", + "| data_files_to_use: list | list of file extensions in the input folder to use for running the transform |\n", + "|pdf2parquet_double_precision: int | control precision |\n" + ] + }, + { + "cell_type": "markdown", + "id": "ebf1f782-0e61-485c-8670-81066beb734c", + "metadata": {}, + "source": [ + "##### ***** Import required Classes and modules" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "c2a12abc-9460-4e45-8961-873b48a9ab19", + "metadata": {}, + "outputs": [], + "source": [ + "import ast\n", + "import os\n", + "import sys\n", + "\n", + "from data_processing.runtime.pure_python import PythonTransformLauncher\n", + "from data_processing.utils import ParamsUtils\n", + "from pdf2parquet_transform_python import Pdf2ParquetPythonTransformConfiguration\n" + ] + }, + { + "cell_type": "markdown", + "id": "7234563c-2924-4150-8a31-4aec98c1bf33", + "metadata": {}, + "source": [ + "##### ***** Setup runtime parameters for this transform" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "e90a853e-412f-45d7-af3d-959e755aeebb", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# create parameters\n", + "input_folder = os.path.join(\"python\", \"test-data\", \"input\")\n", + "output_folder = os.path.join( \"python\", \"output\")\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "params = {\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " \"data_files_to_use\": ast.literal_eval(\"['.pdf','.docx','.pptx','.zip']\"),\n", + " # execution info\n", + " \"runtime_pipeline_id\": \"pipeline_id\",\n", + " \"runtime_job_id\": \"job_id\",\n", + " # pdf2parquet params\n", + " \"pdf2parquet_double_precision\": 0,\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "7949f66a-d207-45ef-9ad7-ad9406f8d42a", + "metadata": {}, + "source": [ + "##### ***** Use python runtime to invoke the transform" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "0775e400-7469-49a6-8998-bd4772931459", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "13:23:55 INFO - pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': , 'bitmap_area_threshold': 0.05, 'pdf_backend': , 'double_precision': 0}\n", + "13:23:55 INFO - pipeline id pipeline_id\n", + "13:23:55 INFO - code location None\n", + "13:23:55 INFO - data factory data_ is using local data access: input_folder - python/test-data/input output_folder - python/output\n", + "13:23:55 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:23:55 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf', '.docx', '.pptx', '.zip'], files to checkpoint ['.parquet']\n", + "13:23:55 INFO - orchestrator pdf2parquet started at 2024-11-20 13:23:55\n", + "13:23:55 INFO - Number of files is 2, source profile {'max_file_size': 0.3013172149658203, 'min_file_size': 0.2757863998413086, 'total_file_size': 0.5771036148071289}\n", + "13:23:55 INFO - Initializing models\n", + "13:23:58 INFO - Processing archive_doc_filename='2305.03393v1-pg9.pdf' \n", + "13:23:59 INFO - Processing archive_doc_filename='2408.09869v1-pg1.pdf' \n", + "13:24:00 INFO - Completed 1 files (50.0%) in 0.029 min\n", + "13:24:03 INFO - Completed 2 files (100.0%) in 0.08 min\n", + "13:24:03 INFO - Done processing 2 files, waiting for flush() completion.\n", + "13:24:03 INFO - done flushing in 0.0 sec\n", + "13:24:03 INFO - Completed execution in 0.132 min, execution result 0\n" + ] + } + ], + "source": [ + "%%capture\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "launcher = PythonTransformLauncher(runtime_config=Pdf2ParquetPythonTransformConfiguration())\n", + "launcher.launch()\n" + ] + }, + { + "cell_type": "markdown", + "id": "c3df5adf-4717-4a03-864d-9151cd3f134b", + "metadata": {}, + "source": [ + "##### **** The specified folder will include the transformed parquet files." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "7276fe84-6512-4605-ab65-747351e13a7c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['python/output/redp5110-ch1.parquet',\n", + " 'python/output/metadata.json',\n", + " 'python/output/archive1.parquet']" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import glob\n", + "glob.glob(\"python/output/*\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fef6667e-71ed-4054-9382-55c6bb3fda70", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From 0de3005f4fb4788be54c43f9c675346d221fa474 Mon Sep 17 00:00:00 2001 From: Maroun Touma Date: Wed, 20 Nov 2024 15:20:52 -0500 Subject: [PATCH 25/38] Added doc chunk minimal notebook Signed-off-by: Maroun Touma Signed-off-by: Pooja Holkar --- transforms/language/doc_chunk/doc_chunk.ipynb | 194 ++++++++++++++++++ 1 file changed, 194 insertions(+) create mode 100644 transforms/language/doc_chunk/doc_chunk.ipynb diff --git a/transforms/language/doc_chunk/doc_chunk.ipynb b/transforms/language/doc_chunk/doc_chunk.ipynb new file mode 100644 index 000000000..822d5b302 --- /dev/null +++ b/transforms/language/doc_chunk/doc_chunk.ipynb @@ -0,0 +1,194 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "afd55886-5f5b-4794-838e-ef8179fb0394", + "metadata": {}, + "source": [ + "##### **** These pip install need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release\n", + "\n", + "##### **** example for transform developers working from git clone\n", + "```\n", + "make venv\n", + "source venv/bin/activate && pip install jupyterlab\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "## This is here as a reference only\n", + "# Users and application developers must use the right tag for the latest from pypi\n", + "#!pip install data-prep-toolkit\n", + "#!pip install data-prep-toolkit-transforms\n", + "#!pip install data-prep-connector" + ] + }, + { + "cell_type": "markdown", + "id": "407fd4e4-265d-4ec7-bbc9-b43158f5f1f3", + "metadata": { + "jp-MarkdownHeadingCollapsed": true + }, + "source": [ + "##### **** Configure the transform parameters. We will only show the use of double_precision. For a complete list, please refer to the Readme.md for this transform\n", + "##### \n", + "| parameter:type | value | Description |\n", + "| --- | --- | --- |\n", + "|data_files_to_use: list | .parquet | Process all parquet files in the input folder |\n", + "| doc_chunk_chunking_type: str | dl_json | |\n" + ] + }, + { + "cell_type": "markdown", + "id": "ebf1f782-0e61-485c-8670-81066beb734c", + "metadata": {}, + "source": [ + "##### ***** Import required Classes and modules" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "c2a12abc-9460-4e45-8961-873b48a9ab19", + "metadata": {}, + "outputs": [], + "source": [ + "import ast\n", + "import os\n", + "import sys\n", + "\n", + "from data_processing.runtime.pure_python import PythonTransformLauncher\n", + "from data_processing.utils import ParamsUtils\n", + "from doc_chunk_transform_python import DocChunkPythonTransformConfiguration\n" + ] + }, + { + "cell_type": "markdown", + "id": "7234563c-2924-4150-8a31-4aec98c1bf33", + "metadata": {}, + "source": [ + "##### ***** Setup runtime parameters for this transform" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "e90a853e-412f-45d7-af3d-959e755aeebb", + "metadata": {}, + "outputs": [], + "source": [ + "# create parameters\n", + "input_folder = os.path.join(\"python\", \"test-data\", \"input\")\n", + "output_folder = os.path.join( \"python\", \"output\")\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "params = {\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " \"data_files_to_use\": ast.literal_eval(\"['.parquet']\"),\n", + " \"runtime_pipeline_id\": \"pipeline_id\",\n", + " \"runtime_job_id\": \"job_id\",\n", + " \"doc_chunk_chunking_type\": \"dl_json\",\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "7949f66a-d207-45ef-9ad7-ad9406f8d42a", + "metadata": {}, + "source": [ + "##### ***** Use python runtime to invoke the transform" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "0775e400-7469-49a6-8998-bd4772931459", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "15:19:48 INFO - pipeline id pipeline_id\n", + "15:19:48 INFO - code location None\n", + "15:19:48 INFO - data factory data_ is using local data access: input_folder - python/test-data/input output_folder - python/output\n", + "15:19:48 INFO - data factory data_ max_files -1, n_sample -1\n", + "15:19:48 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "15:19:48 INFO - orchestrator doc_chunk started at 2024-11-20 15:19:48\n", + "15:19:48 INFO - Number of files is 1, source profile {'max_file_size': 0.011513710021972656, 'min_file_size': 0.011513710021972656, 'total_file_size': 0.011513710021972656}\n", + "15:19:48 INFO - Completed 1 files (100.0%) in 0.001 min\n", + "15:19:48 INFO - Done processing 1 files, waiting for flush() completion.\n", + "15:19:48 INFO - done flushing in 0.0 sec\n", + "15:19:48 INFO - Completed execution in 0.001 min, execution result 0\n" + ] + } + ], + "source": [ + "%%capture\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "launcher = PythonTransformLauncher(runtime_config=DocChunkPythonTransformConfiguration())\n", + "launcher.launch()\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "c3df5adf-4717-4a03-864d-9151cd3f134b", + "metadata": {}, + "source": [ + "##### **** The specified folder will include the transformed parquet files." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "7276fe84-6512-4605-ab65-747351e13a7c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['python/output/metadata.json', 'python/output/test1.parquet']" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import glob\n", + "glob.glob(\"python/output/*\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From 51c3b3f958a1e9201715da80b4747a9b5b856027 Mon Sep 17 00:00:00 2001 From: Shahrokh Daijavad Date: Wed, 20 Nov 2024 11:14:05 -0800 Subject: [PATCH 26/38] Update pdf2parquet.ipynb Made a few changes to the comment cells that explain the execution of the immediate next cell Signed-off-by: Pooja Holkar --- transforms/language/pdf2parquet/pdf2parquet.ipynb | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/transforms/language/pdf2parquet/pdf2parquet.ipynb b/transforms/language/pdf2parquet/pdf2parquet.ipynb index 1ba814170..87e58c7b6 100644 --- a/transforms/language/pdf2parquet/pdf2parquet.ipynb +++ b/transforms/language/pdf2parquet/pdf2parquet.ipynb @@ -5,13 +5,13 @@ "id": "afd55886-5f5b-4794-838e-ef8179fb0394", "metadata": {}, "source": [ - "##### **** These pip install need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release\n", - "\n", - "##### **** example for transform developers working from git clone\n", + "##### **** Example for transform developers working from git clone\n", "```\n", "make venv\n", "source venv/bin/activate && pip install jupyterlab\n", "```" + "##### **** The pip installs below need to be adapted to use the appropriate release level. Alternatively, the venv running the jupyter lab could be pre-configured with a requirement file that includes the right release\n", + "\n", ] }, { @@ -36,7 +36,7 @@ "jp-MarkdownHeadingCollapsed": true }, "source": [ - "##### **** Configure the transform parameters. We will only show the use of double_precision. For a complete list, please refer to the Readme.md for this transform\n", + "##### **** Configure the transform parameters. We will only show the use of double_precision. For a complete list, please refer to the README.md for this transform\n", "##### \n", "| parameter:type | Description |\n", "| --- | --- |\n", @@ -49,7 +49,7 @@ "id": "ebf1f782-0e61-485c-8670-81066beb734c", "metadata": {}, "source": [ - "##### ***** Import required Classes and modules" + "##### ***** Import required classes and modules" ] }, { From fc3d13478283527a52bb375b61d6a90231c70dd6 Mon Sep 17 00:00:00 2001 From: Shahrokh Daijavad Date: Wed, 20 Nov 2024 11:27:39 -0800 Subject: [PATCH 27/38] Update pdf2parquet.ipynb Restored to a valid Notebook Signed-off-by: Pooja Holkar --- transforms/language/pdf2parquet/pdf2parquet.ipynb | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/transforms/language/pdf2parquet/pdf2parquet.ipynb b/transforms/language/pdf2parquet/pdf2parquet.ipynb index 87e58c7b6..1200e7a7f 100644 --- a/transforms/language/pdf2parquet/pdf2parquet.ipynb +++ b/transforms/language/pdf2parquet/pdf2parquet.ipynb @@ -5,13 +5,14 @@ "id": "afd55886-5f5b-4794-838e-ef8179fb0394", "metadata": {}, "source": [ - "##### **** Example for transform developers working from git clone\n", + "##### **** These pip install need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release\n", + "\n", + "##### **** example: \n", "```\n", - "make venv\n", - "source venv/bin/activate && pip install jupyterlab\n", + "python -m venv && source venv/bin/activate\n", + "pip install -r requirements.txt\n", + "pip install jupyterlab\n", "```" - "##### **** The pip installs below need to be adapted to use the appropriate release level. Alternatively, the venv running the jupyter lab could be pre-configured with a requirement file that includes the right release\n", - "\n", ] }, { From 0e5e1ea35b8af9b32c537ab7481f134df521d94c Mon Sep 17 00:00:00 2001 From: Maroun Touma Date: Wed, 20 Nov 2024 15:46:26 -0500 Subject: [PATCH 28/38] minimal sample notebook for how transform can be invoked Signed-off-by: Maroun Touma Signed-off-by: Pooja Holkar --- .../language/text_encoder/text_encoder.ipynb | 191 ++++++++++++++++++ 1 file changed, 191 insertions(+) create mode 100644 transforms/language/text_encoder/text_encoder.ipynb diff --git a/transforms/language/text_encoder/text_encoder.ipynb b/transforms/language/text_encoder/text_encoder.ipynb new file mode 100644 index 000000000..4adff9edf --- /dev/null +++ b/transforms/language/text_encoder/text_encoder.ipynb @@ -0,0 +1,191 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "afd55886-5f5b-4794-838e-ef8179fb0394", + "metadata": {}, + "source": [ + "##### **** These pip install need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release\n", + "\n", + "##### **** example: \n", + "```\n", + "python -m venv && source venv/bin/activate\n", + "pip install -r requirements.txt\n", + "pip install jupyterlab\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "## This is here as a reference only\n", + "# Users and application developers must use the right tag for the latest from pypi\n", + "#!pip install data-prep-toolkit\n", + "#!pip install data-prep-toolkit-transforms\n", + "#!pip install data-prep-connector" + ] + }, + { + "cell_type": "markdown", + "id": "407fd4e4-265d-4ec7-bbc9-b43158f5f1f3", + "metadata": { + "jp-MarkdownHeadingCollapsed": true + }, + "source": [ + "##### **** Configure the transform parameters. We will only show the use of double_precision. For a complete list, please refer to the README.md for this transform\n", + "##### \n", + "| parameter:type | Description |\n", + "| --- | --- |\n", + "| data_files_to_use: list | list of file extensions in the input folder to use for running the transform |\n", + "|pdf2parquet_double_precision: int | control precision |\n" + ] + }, + { + "cell_type": "markdown", + "id": "ebf1f782-0e61-485c-8670-81066beb734c", + "metadata": {}, + "source": [ + "##### ***** Import required classes and modules" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "c2a12abc-9460-4e45-8961-873b48a9ab19", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import sys\n", + "\n", + "from data_processing.runtime.pure_python import PythonTransformLauncher\n", + "from data_processing.utils import ParamsUtils\n", + "from text_encoder_transform_python import TextEncoderPythonTransformConfiguration\n" + ] + }, + { + "cell_type": "markdown", + "id": "7234563c-2924-4150-8a31-4aec98c1bf33", + "metadata": {}, + "source": [ + "##### ***** Setup runtime parameters for this transform" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "e90a853e-412f-45d7-af3d-959e755aeebb", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "input_folder = os.path.join (\"python\", \"test-data\", \"input\")\n", + "output_folder = os.path.join( \"python\", \"output\")\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "params = {\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " \"runtime_pipeline_id\": \"pipeline_id\",\n", + " \"runtime_job_id\": \"job_id\",\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "7949f66a-d207-45ef-9ad7-ad9406f8d42a", + "metadata": {}, + "source": [ + "##### ***** Use python runtime to invoke the transform" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "0775e400-7469-49a6-8998-bd4772931459", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "15:44:57 INFO - pipeline id pipeline_id\n", + "15:44:57 INFO - code location None\n", + "15:44:57 INFO - data factory data_ is using local data access: input_folder - python/test-data/input output_folder - python/output\n", + "15:44:57 INFO - data factory data_ max_files -1, n_sample -1\n", + "15:44:57 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "15:44:57 INFO - orchestrator text_encoder started at 2024-11-20 15:44:57\n", + "15:44:57 INFO - Number of files is 1, source profile {'max_file_size': 0.0010089874267578125, 'min_file_size': 0.0010089874267578125, 'total_file_size': 0.0010089874267578125}\n", + "15:44:58 INFO - Completed 1 files (100.0%) in 0.003 min\n", + "15:44:58 INFO - Done processing 1 files, waiting for flush() completion.\n", + "15:44:58 INFO - done flushing in 0.0 sec\n", + "15:44:58 INFO - Completed execution in 0.017 min, execution result 0\n" + ] + } + ], + "source": [ + "%%capture\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "launcher = PythonTransformLauncher(runtime_config=TextEncoderPythonTransformConfiguration())\n", + "launcher.launch()\n" + ] + }, + { + "cell_type": "markdown", + "id": "c3df5adf-4717-4a03-864d-9151cd3f134b", + "metadata": {}, + "source": [ + "##### **** The specified folder will include the transformed parquet files." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "7276fe84-6512-4605-ab65-747351e13a7c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['python/output/metadata.json', 'python/output/test1.parquet']" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import glob\n", + "glob.glob(\"python/output/*\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From 7941a916729345973ead526a53fed78e475f5702 Mon Sep 17 00:00:00 2001 From: SHAHROKH DAIJAVAD Date: Wed, 20 Nov 2024 15:24:51 -0800 Subject: [PATCH 29/38] restoring the make venv Signed-off-by: SHAHROKH DAIJAVAD Signed-off-by: Pooja Holkar --- .../language/pdf2parquet/pdf2parquet.ipynb | 38 +++++++++---------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/transforms/language/pdf2parquet/pdf2parquet.ipynb b/transforms/language/pdf2parquet/pdf2parquet.ipynb index 1200e7a7f..e5548eb4c 100644 --- a/transforms/language/pdf2parquet/pdf2parquet.ipynb +++ b/transforms/language/pdf2parquet/pdf2parquet.ipynb @@ -9,8 +9,8 @@ "\n", "##### **** example: \n", "```\n", - "python -m venv && source venv/bin/activate\n", - "pip install -r requirements.txt\n", + "make venv \n", + "source venv/bin/activate \n", "pip install jupyterlab\n", "```" ] @@ -122,22 +122,22 @@ "name": "stderr", "output_type": "stream", "text": [ - "13:23:55 INFO - pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': , 'bitmap_area_threshold': 0.05, 'pdf_backend': , 'double_precision': 0}\n", - "13:23:55 INFO - pipeline id pipeline_id\n", - "13:23:55 INFO - code location None\n", - "13:23:55 INFO - data factory data_ is using local data access: input_folder - python/test-data/input output_folder - python/output\n", - "13:23:55 INFO - data factory data_ max_files -1, n_sample -1\n", - "13:23:55 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf', '.docx', '.pptx', '.zip'], files to checkpoint ['.parquet']\n", - "13:23:55 INFO - orchestrator pdf2parquet started at 2024-11-20 13:23:55\n", - "13:23:55 INFO - Number of files is 2, source profile {'max_file_size': 0.3013172149658203, 'min_file_size': 0.2757863998413086, 'total_file_size': 0.5771036148071289}\n", - "13:23:55 INFO - Initializing models\n", - "13:23:58 INFO - Processing archive_doc_filename='2305.03393v1-pg9.pdf' \n", - "13:23:59 INFO - Processing archive_doc_filename='2408.09869v1-pg1.pdf' \n", - "13:24:00 INFO - Completed 1 files (50.0%) in 0.029 min\n", - "13:24:03 INFO - Completed 2 files (100.0%) in 0.08 min\n", - "13:24:03 INFO - Done processing 2 files, waiting for flush() completion.\n", - "13:24:03 INFO - done flushing in 0.0 sec\n", - "13:24:03 INFO - Completed execution in 0.132 min, execution result 0\n" + "15:13:18 INFO - pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': , 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': , 'bitmap_area_threshold': 0.05, 'pdf_backend': , 'double_precision': 0}\n", + "15:13:18 INFO - pipeline id pipeline_id\n", + "15:13:18 INFO - code location None\n", + "15:13:18 INFO - data factory data_ is using local data access: input_folder - python/test-data/input output_folder - python/output\n", + "15:13:18 INFO - data factory data_ max_files -1, n_sample -1\n", + "15:13:18 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf', '.docx', '.pptx', '.zip'], files to checkpoint ['.parquet']\n", + "15:13:18 INFO - orchestrator pdf2parquet started at 2024-11-20 15:13:18\n", + "15:13:18 INFO - Number of files is 2, source profile {'max_file_size': 0.3013172149658203, 'min_file_size': 0.2757863998413086, 'total_file_size': 0.5771036148071289}\n", + "15:13:18 INFO - Initializing models\n", + "15:14:08 INFO - Processing archive_doc_filename='2305.03393v1-pg9.pdf' \n", + "15:14:09 INFO - Processing archive_doc_filename='2408.09869v1-pg1.pdf' \n", + "15:14:10 INFO - Completed 1 files (50.0%) in 0.04 min\n", + "15:14:18 INFO - Completed 2 files (100.0%) in 0.179 min\n", + "15:14:18 INFO - Done processing 2 files, waiting for flush() completion.\n", + "15:14:18 INFO - done flushing in 0.0 sec\n", + "15:14:18 INFO - Completed execution in 1.007 min, execution result 0\n" ] } ], @@ -205,7 +205,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.10" + "version": "3.10.8" } }, "nbformat": 4, From 1eab380353227a7cc4c194bcbf2c173dcbbff81f Mon Sep 17 00:00:00 2001 From: SHAHROKH DAIJAVAD Date: Wed, 20 Nov 2024 15:58:01 -0800 Subject: [PATCH 30/38] unification of notebooks Signed-off-by: SHAHROKH DAIJAVAD Signed-off-by: Pooja Holkar --- transforms/language/doc_chunk/doc_chunk.ipynb | 6 ++---- transforms/language/pdf2parquet/pdf2parquet.ipynb | 9 ++++----- .../language/text_encoder/text_encoder.ipynb | 15 ++++----------- 3 files changed, 10 insertions(+), 20 deletions(-) diff --git a/transforms/language/doc_chunk/doc_chunk.ipynb b/transforms/language/doc_chunk/doc_chunk.ipynb index 822d5b302..3a8466037 100644 --- a/transforms/language/doc_chunk/doc_chunk.ipynb +++ b/transforms/language/doc_chunk/doc_chunk.ipynb @@ -5,9 +5,7 @@ "id": "afd55886-5f5b-4794-838e-ef8179fb0394", "metadata": {}, "source": [ - "##### **** These pip install need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release\n", - "\n", - "##### **** example for transform developers working from git clone\n", + "##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n", "```\n", "make venv\n", "source venv/bin/activate && pip install jupyterlab\n", @@ -36,7 +34,7 @@ "jp-MarkdownHeadingCollapsed": true }, "source": [ - "##### **** Configure the transform parameters. We will only show the use of double_precision. For a complete list, please refer to the Readme.md for this transform\n", + "##### **** Configure the transform parameters. We will only show the use of data_files_to_use and doc_chunk_chunking_type. For a complete list of parameters, please refer to the README.md for this transform\n", "##### \n", "| parameter:type | value | Description |\n", "| --- | --- | --- |\n", diff --git a/transforms/language/pdf2parquet/pdf2parquet.ipynb b/transforms/language/pdf2parquet/pdf2parquet.ipynb index e5548eb4c..2d26741b3 100644 --- a/transforms/language/pdf2parquet/pdf2parquet.ipynb +++ b/transforms/language/pdf2parquet/pdf2parquet.ipynb @@ -5,9 +5,7 @@ "id": "afd55886-5f5b-4794-838e-ef8179fb0394", "metadata": {}, "source": [ - "##### **** These pip install need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release\n", - "\n", - "##### **** example: \n", + "##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n", "```\n", "make venv \n", "source venv/bin/activate \n", @@ -37,12 +35,13 @@ "jp-MarkdownHeadingCollapsed": true }, "source": [ - "##### **** Configure the transform parameters. We will only show the use of double_precision. For a complete list, please refer to the README.md for this transform\n", + "##### **** Configure the transform parameters. We will only show the use of double_precision. For a complete list, please refer to the README.md for this transform.\n", "##### \n", "| parameter:type | Description |\n", "| --- | --- |\n", "| data_files_to_use: list | list of file extensions in the input folder to use for running the transform |\n", - "|pdf2parquet_double_precision: int | control precision |\n" + "|pdf2parquet_double_precision: int | If set, all floating points (e.g. bounding boxes) are rounded to this precision. For tests it is advised to use 0 |\n", + "\n" ] }, { diff --git a/transforms/language/text_encoder/text_encoder.ipynb b/transforms/language/text_encoder/text_encoder.ipynb index 4adff9edf..aca309594 100644 --- a/transforms/language/text_encoder/text_encoder.ipynb +++ b/transforms/language/text_encoder/text_encoder.ipynb @@ -5,12 +5,10 @@ "id": "afd55886-5f5b-4794-838e-ef8179fb0394", "metadata": {}, "source": [ - "##### **** These pip install need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release\n", - "\n", - "##### **** example: \n", + "##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n", "```\n", - "python -m venv && source venv/bin/activate\n", - "pip install -r requirements.txt\n", + "make venv \n", + "source venv/bin/activate \n", "pip install jupyterlab\n", "```" ] @@ -37,12 +35,7 @@ "jp-MarkdownHeadingCollapsed": true }, "source": [ - "##### **** Configure the transform parameters. We will only show the use of double_precision. For a complete list, please refer to the README.md for this transform\n", - "##### \n", - "| parameter:type | Description |\n", - "| --- | --- |\n", - "| data_files_to_use: list | list of file extensions in the input folder to use for running the transform |\n", - "|pdf2parquet_double_precision: int | control precision |\n" + "##### **** Configure the transform parameters. For this notebook, we use all the default parameters. For a complete list of parameters, please refer to the README.md for this transform.\n" ] }, { From b7d34ce5dc8e275591d39438a1d473120c0c3240 Mon Sep 17 00:00:00 2001 From: Maroun Touma Date: Fri, 22 Nov 2024 18:12:50 -0500 Subject: [PATCH 31/38] added constraint for pydantic to prevent llama-index-core from picking up 2.10 Signed-off-by: Maroun Touma Signed-off-by: Pooja Holkar --- transforms/language/doc_chunk/python/requirements.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/transforms/language/doc_chunk/python/requirements.txt b/transforms/language/doc_chunk/python/requirements.txt index dd076d0e0..c24d0c3e2 100644 --- a/transforms/language/doc_chunk/python/requirements.txt +++ b/transforms/language/doc_chunk/python/requirements.txt @@ -1,3 +1,4 @@ data-prep-toolkit==0.2.2.dev2 docling-core==2.3.0 +pydantic>=2.0.0,<2.10.0 llama-index-core>=0.11.22,<0.12.0 From b401b707e6fa669e7726a5d8c19014458327e1a7 Mon Sep 17 00:00:00 2001 From: Sungeun An Date: Tue, 19 Nov 2024 00:17:27 -0800 Subject: [PATCH 32/38] updated README file and added a sample notebook Signed-off-by: Sungeun An Signed-off-by: Pooja Holkar --- .../html2parquet/notebooks/html2parquet.ipynb | 220 ++++++++++++++++++ .../language/html2parquet/python/README.md | 103 +++++++- 2 files changed, 315 insertions(+), 8 deletions(-) create mode 100644 transforms/language/html2parquet/notebooks/html2parquet.ipynb diff --git a/transforms/language/html2parquet/notebooks/html2parquet.ipynb b/transforms/language/html2parquet/notebooks/html2parquet.ipynb new file mode 100644 index 000000000..c2713899d --- /dev/null +++ b/transforms/language/html2parquet/notebooks/html2parquet.ipynb @@ -0,0 +1,220 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "id": "c4f9c952-cb3b-40f1-bfb5-00d9a43a5715", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "!pip install data-prep-toolkit==0.2.2.dev2\n", + "!pip install 'data-prep-toolkit-transforms[html2parquet]==0.2.2.dev2'\n", + "!pip install pandas" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "20663a67-5aa1-4b61-b989-94201613e41f", + "metadata": {}, + "outputs": [], + "source": [ + "from data_processing.runtime.pure_python import PythonTransformLauncher\n", + "from data_processing.utils import ParamsUtils\n", + "\n", + "from html2parquet_transform_python import Html2ParquetPythonTransformConfiguration\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "e75f6922-eb0f-4164-a536-f96393e04604", + "metadata": {}, + "outputs": [], + "source": [ + "import ast\n", + "\n", + "# create parameters\n", + "local_conf = {\n", + " \"input_folder\": \"input\",\n", + " \"output_folder\": \"output\",\n", + "}\n", + "\n", + "params = {\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " \"data_files_to_use\": ast.literal_eval(\"['.html']\"),\n", + "}\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "4d2354db-1bb3-4a71-98df-f0f148af3a02", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "17:09:40 INFO - html2parquet parameters are : {'output_format': , 'favor_precision': , 'favor_recall': }\n", + "17:09:40 INFO - pipeline id pipeline_id\n", + "17:09:40 INFO - code location None\n", + "17:09:40 INFO - data factory data_ is using local data access: input_folder - input output_folder - output\n", + "17:09:40 INFO - data factory data_ max_files -1, n_sample -1\n", + "17:09:40 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.html'], files to checkpoint ['.parquet']\n", + "17:09:40 INFO - orchestrator html2parquet started at 2024-11-13 17:09:40\n", + "17:09:40 INFO - Number of files is 1, source profile {'max_file_size': 0.2035503387451172, 'min_file_size': 0.2035503387451172, 'total_file_size': 0.2035503387451172}\n", + "17:09:47 INFO - Completed 1 files (100.0%) in 0.111 min\n", + "17:09:47 INFO - Done processing 1 files, waiting for flush() completion.\n", + "17:09:47 INFO - done flushing in 0.0 sec\n", + "17:09:47 INFO - Completed execution in 0.111 min, execution result 0\n" + ] + } + ], + "source": [ + "\n", + "import sys\n", + "sys.argv = ParamsUtils.dict_to_req(d=(params))\n", + "# create launcher\n", + "launcher = PythonTransformLauncher(Html2ParquetPythonTransformConfiguration())\n", + "# launch\n", + "return_code = launcher.launch()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "e2bee8da-c566-4e45-bca1-354dfd04b0df", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "

\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
titledocumentcontentsdocument_idsizedate_acquired
0ai-alliance-index.htmlai-alliance-index.html![](https://images.prismic.io/ai-alliance/Ztf3...f86b8cebe07ec9f43a351bb4dc897f162f5a88cbb0f121...3942024-11-13T17:09:40.947095
\n", + "
" + ], + "text/plain": [ + " title document \\\n", + "0 ai-alliance-index.html ai-alliance-index.html \n", + "\n", + " contents \\\n", + "0 ![](https://images.prismic.io/ai-alliance/Ztf3... \n", + "\n", + " document_id size \\\n", + "0 f86b8cebe07ec9f43a351bb4dc897f162f5a88cbb0f121... 394 \n", + "\n", + " date_acquired \n", + "0 2024-11-13T17:09:40.947095 " + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pyarrow.parquet as pq\n", + "import pandas as pd\n", + "table = pq.read_table('output/ai-alliance-index.parquet')\n", + "table.to_pandas()" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "cde6e37d-c437-490f-8e01-f4f51a123484", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'![](https://images.prismic.io/ai-alliance/Ztf3gLzzk9ZrW8v8_caliopensourceslide.jpg?auto=format%2Ccompress&fit=max&w=3840)\\n\\n## Open Source AI Demo Night\\n\\nThe AI Alliance, in collaboration with Cerebral Valley and Ollama, hosted Open Source AI Demo Night in San Francisco, bringing together more than 200+ developers and innovators to showcase and celebrate the latest advances in open-source AI.'" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "table.to_pandas()['contents'][0]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2fd0d13b-1ff6-4988-91fb-52c25ba998c8", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "587e43ee-7b51-4a9c-8bf2-0a23e309a7ae", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/transforms/language/html2parquet/python/README.md b/transforms/language/html2parquet/python/README.md index 0d25553e1..6b12bffea 100644 --- a/transforms/language/html2parquet/python/README.md +++ b/transforms/language/html2parquet/python/README.md @@ -1,25 +1,55 @@ -# html2parquet Transform +# HTML to Parquet Transform -This tranforms iterate through zip of HTML files or single HTML files and generates parquet files containing the converted document in string. +--- -The HTML conversion is using the [Trafilatura](https://trafilatura.readthedocs.io/en/latest/usage-python.html). +## Description -## Output format +This transform iterates through zipped collections of HTML files or single HTML files and generates Parquet files containing the extracted content, leveraging the [Trafilatura library](https://trafilatura.readthedocs.io/en/latest/usage-python.html) for extraction of text, tables, images, and other components. -The output format will contain the following colums +--- + +## Contributors + +- Sungeun An (sungeun.an@ibm.com) +- Syed Zawad (szawad@ibm.com) + +--- + +## Date + +**Last updated:** 10/16/24 +- **Update details:** + - Added Trafilatura parameters (`favor_precision` and `favor_recall`) for enhanced control over content extraction. + - Enhanced table and image extraction features. + - See [Pull Request #707](https://github.com/IBM/data-prep-kit/pull/707) for more details. + +--- + +## Input and Output + +### Input +- Accepted Formats: Single HTML files or zipped collections of HTML files. +- Sample Input Files: [sample html files](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/html2parquet/python/test-data/input) + +### Output +- Format: Parquet files with the following structure: ```jsonc { - "title": "string", // the member filename - "document": "string", // the base of the source archive - "contents": "string", // the content of the HTML + "title": "string", // the member filename + "document": "string", // the base of the source archive + "contents": "string", // the content of the HTML "document_id": "string", // the document id, a hash of `contents` "size": "string", // the size of `contents` "date_acquired": "date", // the date when the transform was executing } ``` + + ## Parameters +### User-Configurable Parameters + The table below provides the parameters that users can adjust to control the behavior of the extraction: | Parameter | Default | Description | @@ -28,6 +58,8 @@ The table below provides the parameters that users can adjust to control the beh | `favor_precision` | `True` | Prefers less content but more accurate extraction. Options: `True`, `False`. | | `favor_recall` | `True` | Extracts more content when uncertain. Options: `True`, `False`. | +### Default Parameters + The table below provides the parameters that are enabled by default to ensure a comprehensive extraction process: | Parameter | Default | Description | @@ -43,6 +75,7 @@ The table below provides the parameters that are enabled by default to ensure a - To prioritize extracting more content over accuracy, set `favor_recall=True` and `favor_precision=False`. - When invoking the CLI, use the following syntax for these parameters: `--html2parquet_`. For example: `--html2parquet_output_format='markdown'`. + ## Example ### Sample HTML @@ -155,3 +188,57 @@ Chicago | ## Contact Us ``` +## Usage + +### Command-Line Interface (CLI) + +Run the transform with the following command: + +``` +python ../html2parquet/python/src/html2parquet_transform_python.py \ + --data_local_config "{'input_folder': '../html2parquet/python/test-data/input', 'output_folder': '../html2parquet/python/test-data/expected'}" \ + --data_files_to_use '[".html", ".zip"]' +``` + +- When invoking the CLI, use the following syntax for these parameters: `--html2parquet_`. For example: `--html2parquet_output_format='markdown'`. + +### Python Code + +To run the transform programmatically: + +``` +from data_processing.runtime.pure_python import PythonTransformLauncher +from data_processing.utils import ParamsUtils + +from html2parquet_transform_python import Html2ParquetPythonTransformConfiguration +import ast +import sys + +# create parameters +local_conf = { + "input_folder": "input", + "output_folder": "output", +} + +params = { + # Data access. Only required parameters are specified + "data_local_config": ParamsUtils.convert_to_ast(local_conf), + "data_files_to_use": ast.literal_eval("['.html']"), +} + +sys.argv = ParamsUtils.dict_to_req(d=(params)) +# create launcher +launcher = PythonTransformLauncher(Html2ParquetPythonTransformConfiguration()) +# launch +return_code = launcher.launch() + +``` + +### Sample Notebook + +See the [sample notebook](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/html2parquet/notebooks/html2parquet.ipynb) for an example. + + +## Further Resources + +- [Trafilatura](https://trafilatura.readthedocs.io/en/latest/usage-python.html). From 8fe07f4bc56b8ae826cb56574905f80d7bcb9ae1 Mon Sep 17 00:00:00 2001 From: Sungeun An Date: Tue, 19 Nov 2024 09:36:37 -0800 Subject: [PATCH 33/38] removed python code in README and minor changes in the notebook Signed-off-by: Sungeun An Signed-off-by: Pooja Holkar --- .../html2parquet/notebooks/html2parquet.ipynb | 8 ++--- .../language/html2parquet/python/README.md | 31 ------------------- 2 files changed, 4 insertions(+), 35 deletions(-) diff --git a/transforms/language/html2parquet/notebooks/html2parquet.ipynb b/transforms/language/html2parquet/notebooks/html2parquet.ipynb index c2713899d..230805144 100644 --- a/transforms/language/html2parquet/notebooks/html2parquet.ipynb +++ b/transforms/language/html2parquet/notebooks/html2parquet.ipynb @@ -37,14 +37,14 @@ "\n", "# create parameters\n", "local_conf = {\n", - " \"input_folder\": \"input\",\n", - " \"output_folder\": \"output\",\n", + " \"input_folder\": \"/path/to/your/input/folder\",\n", + " \"output_folder\": \"/path/to/your/output/folder\",\n", "}\n", "\n", "params = {\n", " # Data access. Only required parameters are specified\n", " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", - " \"data_files_to_use\": ast.literal_eval(\"['.html']\"),\n", + " \"data_files_to_use\": ast.literal_eval(\"['.zip', '.html']\"),\n", "}\n" ] }, @@ -154,7 +154,7 @@ "source": [ "import pyarrow.parquet as pq\n", "import pandas as pd\n", - "table = pq.read_table('output/ai-alliance-index.parquet')\n", + "table = pq.read_table('/path/to/your/output/folder/sample.parquet')\n", "table.to_pandas()" ] }, diff --git a/transforms/language/html2parquet/python/README.md b/transforms/language/html2parquet/python/README.md index 6b12bffea..eadd082fb 100644 --- a/transforms/language/html2parquet/python/README.md +++ b/transforms/language/html2parquet/python/README.md @@ -202,37 +202,6 @@ python ../html2parquet/python/src/html2parquet_transform_python.py \ - When invoking the CLI, use the following syntax for these parameters: `--html2parquet_`. For example: `--html2parquet_output_format='markdown'`. -### Python Code - -To run the transform programmatically: - -``` -from data_processing.runtime.pure_python import PythonTransformLauncher -from data_processing.utils import ParamsUtils - -from html2parquet_transform_python import Html2ParquetPythonTransformConfiguration -import ast -import sys - -# create parameters -local_conf = { - "input_folder": "input", - "output_folder": "output", -} - -params = { - # Data access. Only required parameters are specified - "data_local_config": ParamsUtils.convert_to_ast(local_conf), - "data_files_to_use": ast.literal_eval("['.html']"), -} - -sys.argv = ParamsUtils.dict_to_req(d=(params)) -# create launcher -launcher = PythonTransformLauncher(Html2ParquetPythonTransformConfiguration()) -# launch -return_code = launcher.launch() - -``` ### Sample Notebook From 51f3f787b4bbcaac7b1f010ecadeea6c0f36310e Mon Sep 17 00:00:00 2001 From: Sungeun An Date: Thu, 21 Nov 2024 12:52:23 -0800 Subject: [PATCH 34/38] updated with relative path and added markdown for notebook Signed-off-by: Sungeun An Signed-off-by: Pooja Holkar --- .../html2parquet/notebooks/html2parquet.ipynb | 57 ++++++++++++------- .../language/html2parquet/python/README.md | 10 ++-- 2 files changed, 40 insertions(+), 27 deletions(-) diff --git a/transforms/language/html2parquet/notebooks/html2parquet.ipynb b/transforms/language/html2parquet/notebooks/html2parquet.ipynb index 230805144..669a4d30d 100644 --- a/transforms/language/html2parquet/notebooks/html2parquet.ipynb +++ b/transforms/language/html2parquet/notebooks/html2parquet.ipynb @@ -1,9 +1,17 @@ { "cells": [ + { + "cell_type": "markdown", + "id": "8435e1f7-0c2e-49f4-a77a-b525ee6c532b", + "metadata": {}, + "source": [ + "# Html2Parquet Transform Sample Notebook" + ] + }, { "cell_type": "code", - "execution_count": 1, - "id": "c4f9c952-cb3b-40f1-bfb5-00d9a43a5715", + "execution_count": null, + "id": "d9420989-ec8a-4fde-9a93-dc25096389f1", "metadata": {}, "outputs": [], "source": [ @@ -26,6 +34,14 @@ "from html2parquet_transform_python import Html2ParquetPythonTransformConfiguration\n" ] }, + { + "cell_type": "markdown", + "id": "6d85491b-0093-46e7-8653-ca8052ea59f0", + "metadata": {}, + "source": [ + "## Specify input/output folders and parameters" + ] + }, { "cell_type": "code", "execution_count": 3, @@ -37,7 +53,7 @@ "\n", "# create parameters\n", "local_conf = {\n", - " \"input_folder\": \"/path/to/your/input/folder\",\n", + " \"input_folder\": \"/path/to/your/input/folder\", # For the sample input files, refer to the 'python/test-data/input' folder\n", " \"output_folder\": \"/path/to/your/output/folder\",\n", "}\n", "\n", @@ -48,6 +64,14 @@ "}\n" ] }, + { + "cell_type": "markdown", + "id": "0dcd1249-1eb8-4b33-9827-626f90c840b4", + "metadata": {}, + "source": [ + "## Invoke the html2parquet transformation" + ] + }, { "cell_type": "code", "execution_count": 4, @@ -74,7 +98,6 @@ } ], "source": [ - "\n", "import sys\n", "sys.argv = ParamsUtils.dict_to_req(d=(params))\n", "# create launcher\n", @@ -83,6 +106,14 @@ "return_code = launcher.launch()\n" ] }, + { + "cell_type": "markdown", + "id": "3c66468d-703f-427f-a1dd-a758edd334de", + "metadata": {}, + "source": [ + "## Checking the output Parquet file" + ] + }, { "cell_type": "code", "execution_count": 5, @@ -178,22 +209,6 @@ "source": [ "table.to_pandas()['contents'][0]" ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2fd0d13b-1ff6-4988-91fb-52c25ba998c8", - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "587e43ee-7b51-4a9c-8bf2-0a23e309a7ae", - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": { @@ -212,7 +227,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.10" + "version": "3.11.9" } }, "nbformat": 4, diff --git a/transforms/language/html2parquet/python/README.md b/transforms/language/html2parquet/python/README.md index eadd082fb..35e781007 100644 --- a/transforms/language/html2parquet/python/README.md +++ b/transforms/language/html2parquet/python/README.md @@ -18,10 +18,7 @@ This transform iterates through zipped collections of HTML files or single HTML ## Date **Last updated:** 10/16/24 -- **Update details:** - - Added Trafilatura parameters (`favor_precision` and `favor_recall`) for enhanced control over content extraction. - - Enhanced table and image extraction features. - - See [Pull Request #707](https://github.com/IBM/data-prep-kit/pull/707) for more details. +**Update details:** Enhanced table and image extraction features by adding the corresponding Trafilatura parameters. --- @@ -29,7 +26,7 @@ This transform iterates through zipped collections of HTML files or single HTML ### Input - Accepted Formats: Single HTML files or zipped collections of HTML files. -- Sample Input Files: [sample html files](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/html2parquet/python/test-data/input) +- Sample Input Files: [sample html files](test-data/input) ### Output - Format: Parquet files with the following structure: @@ -205,7 +202,8 @@ python ../html2parquet/python/src/html2parquet_transform_python.py \ ### Sample Notebook -See the [sample notebook](https://github.com/IBM/data-prep-kit/tree/dev/transforms/language/html2parquet/notebooks/html2parquet.ipynb) for an example. +See the [sample notebook](../notebooks/html2parquet.ipynb) +) for an example. ## Further Resources From 5746aee432398e64eee0dd368a9c6f1bb3b099c3 Mon Sep 17 00:00:00 2001 From: SHAHROKH DAIJAVAD Date: Fri, 22 Nov 2024 09:15:56 -0800 Subject: [PATCH 35/38] Update web2parquet.ipynb Signed-off-by: Pooja Holkar --- transforms/universal/web2parquet/web2parquet.ipynb | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/transforms/universal/web2parquet/web2parquet.ipynb b/transforms/universal/web2parquet/web2parquet.ipynb index 2bd55f0bc..ea802d734 100644 --- a/transforms/universal/web2parquet/web2parquet.ipynb +++ b/transforms/universal/web2parquet/web2parquet.ipynb @@ -5,12 +5,12 @@ "id": "afd55886-5f5b-4794-838e-ef8179fb0394", "metadata": {}, "source": [ - "##### **** These pip install need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release\n", + "##### **** These pip install need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n", + "##### \n", "\n", - "##### **** example: \n", "```\n", - "python -m venv && source venv/bin/activate\n", - "pip install -r requirements.txt\n", + "make venv \n", + "source venv/bin/activate \n", "pip install jupyterlab\n", "```" ] From fc844704045561b415626ea037ffc7e8c546f31f Mon Sep 17 00:00:00 2001 From: pooja holkar <37286638+PoojaHolkar@users.noreply.github.com> Date: Mon, 25 Nov 2024 20:28:07 +0530 Subject: [PATCH 36/38] Update Run_your_first_PII_redactor_transform.ipynb Signed-off-by: Pooja Holkar --- .../notebooks/PII/Run_your_first_PII_redactor_transform.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb b/examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb index c671ca139..76bde23c2 100644 --- a/examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb +++ b/examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb @@ -475,7 +475,7 @@ "
\n", "
\n", "\n", - "### This notebook effectively demonstrates how to seamlessly apply redaction for PII entities." + "### This notebook effectively demonstrates how to seamlessly apply redaction for PII entities" ] >>>>>>> 25a65d81 (notebook recipe for PII redaction code) } From 823839596e86dc109e1d14fdb8fbb60df25590a6 Mon Sep 17 00:00:00 2001 From: Pooja Holkar Date: Wed, 27 Nov 2024 12:36:58 +0530 Subject: [PATCH 37/38] updated code Signed-off-by: Pooja Holkar --- ...un_your_first_PII_redactor_transform.ipynb | 159 ------------------ 1 file changed, 159 deletions(-) diff --git a/examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb b/examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb index 76bde23c2..7fc1d964d 100644 --- a/examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb +++ b/examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb @@ -4,21 +4,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ -<<<<<<< HEAD - "Extracting Text from PDF and Configuring PII Redactor" -======= "## Extracting Text from PDF and Configuring PII Redactor" ->>>>>>> 25a65d81 (notebook recipe for PII redaction code) ] }, { "cell_type": "markdown", "metadata": {}, "source": [ -<<<<<<< HEAD - "What is a PII Redactor?\n", - "A PII (Personally Identifiable Information) Redactor is a tool or system designed to identify and redact sensitive information in text data. PII includes details that can be used to identify an individual, such as:\n", -======= "\n", "**Author**: Pooja Holkar ,\n", "**email**:poholkar@in.ibm.com\n", @@ -29,23 +21,10 @@ "### What is a PII Redactor?\n", "\n", "A PII (Personally Identifiable Information) Redactor is a tool designed to identify and redact sensitive information in text data. PII includes details that can be used to identify an individual, such as:\n", ->>>>>>> 25a65d81 (notebook recipe for PII redaction code) "\n", "Names\n", "Email addresses\n", "Phone numbers\n", -<<<<<<< HEAD - "Physical or shipping addresses\n", - "Financial details (e.g., credit card numbers)\n", - "Use Case in This Project\n", - "In this project, the PII Redactor is applied to text extracted from invoices to ensure sensitive customer information is not exposed during processing, sharing, or storage.\n", - "\n", - "Workflow Overview\n", - "Text Extraction:\n", - "\n", - "The text from the invoice (a PDF document in this case) is extracted using the pdfplumber library.\n", - "Redactor Configuration:\n", -======= "Addresses\n", "Financial details (e.g., credit card numbers)\n", "\n", @@ -57,30 +36,19 @@ "The text from the invoice (a PDF document in this case) is extracted using the pdfplumber library.\n", "\n", " **Redactor Configuration**\n", ->>>>>>> 25a65d81 (notebook recipe for PII redaction code) "\n", "The system is configured to recognize specific PII entities relevant to invoices, such as:\n", "Customer names\n", "Email addresses\n", "Phone numbers\n", "Shipping addresses\n", -<<<<<<< HEAD - "PII Detection and Redaction:\n", -======= "\n", " **PII Detection and Redaction**\n", ->>>>>>> 25a65d81 (notebook recipe for PII redaction code) "\n", "The redactor scans the extracted text and applies redaction rules, replacing sensitive details with placeholders.\n", "Output:\n", "\n", "The redacted text is displayed alongside a summary of all identified PII entities for auditing purposes.\n", -<<<<<<< HEAD - "Why is PII Redaction Important?\n", - "Data Privacy Compliance: Adheres to regulations like GDPR, HIPAA, or CCPA that mandate safeguarding customer information.\n", - "Risk Mitigation: Prevents unauthorized access to or misuse of sensitive data.\n", - "Automation Benefits: Simplifies and accelerates the process of securing information in large-scale document handling.\n" -======= "\n", "### Why is PII Redaction Important?\n", "\n", @@ -96,14 +64,10 @@ "metadata": {}, "source": [ "### Pre-req: Install data-prep-kit dependencies" ->>>>>>> 25a65d81 (notebook recipe for PII redaction code) ] }, { "cell_type": "code", -<<<<<<< HEAD - "execution_count": 8, -======= "execution_count": 1, "metadata": {}, "outputs": [], @@ -118,16 +82,10 @@ { "cell_type": "code", "execution_count": 2, ->>>>>>> 25a65d81 (notebook recipe for PII redaction code) "metadata": {}, "outputs": [], "source": [ "import pdfplumber\n", -<<<<<<< HEAD - "#from data_processing.transform.table_transform import AbstractTableTransform\n", - "#from data_processing.transform import AbstractTableTransform, TransformConfiguration\n", -======= ->>>>>>> 25a65d81 (notebook recipe for PII redaction code) "from pii_redactor_transform import PIIRedactorTransform\n" ] }, @@ -135,69 +93,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ -<<<<<<< HEAD - "Step 1: Extract Text from PDF" -======= "### Step 1: Inspect the Data \n", "\n", "We will use simple invoice PDF\n", "\n", "[invoicedata](https://raw.githubusercontent.com/PoojaHolkar/data-prep-kit/refs/heads/dev/examples/notebooks/PII/invoicedata/Invoice.pdf)" ->>>>>>> 25a65d81 (notebook recipe for PII redaction code) - ] - }, - { - "cell_type": "code", -<<<<<<< HEAD - "execution_count": 9, - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "#pdf_path = \"/Users/poojaholkar/GSI/WATSONX/WATSONXDATA/DPK/data-prep-kit-dev/invoicedata/invoice_garminwatch.pdf\" # Replace with the path to your uploaded PDF\n", - "pdf_path=\"/Users/poojaholkar/Downloads/Invoice.pdf\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "ename": "SyntaxError", - "evalue": "invalid syntax (2155885561.py, line 3)", - "output_type": "error", - "traceback": [ - "\u001b[0;36m Cell \u001b[0;32mIn[8], line 3\u001b[0;36m\u001b[0m\n\u001b[0;31m pip install presidio_analyzer\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n" - ] - } - ], - "source": [ - "#pip install flair\n", - "#pip install spacy\n", - "#pip install presidio_anonymizer==2.2.355\n", - "#pip install numpy==1.26.4" ] }, { "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Found existing installation: numpy 1.26.4\n", - "Uninstalling numpy-1.26.4:\n", - " Successfully uninstalled numpy-1.26.4\n" - ] - } - ], - "source": [ - "!pip uninstall numpy --yes\n", - "#!pip install numpy==1.19.3\n" -======= "execution_count": 4, "metadata": {}, "outputs": [ @@ -220,29 +124,20 @@ "outputs": [], "source": [ "pdf_path=\"Invoice.pdf\"" ->>>>>>> 25a65d81 (notebook recipe for PII redaction code) ] }, { "cell_type": "markdown", "metadata": {}, "source": [ -<<<<<<< HEAD - "Step 1: Extract Text from PDF\n", -======= "### Step 2: Extract Text from PDF\n", ->>>>>>> 25a65d81 (notebook recipe for PII redaction code) "\n", "This step uses the pdfplumber library to open and read a PDF file. The code processes each page of the PDF to extract text and concatenates it into a single string." ] }, { "cell_type": "code", -<<<<<<< HEAD - "execution_count": 11, -======= "execution_count": 13, ->>>>>>> 25a65d81 (notebook recipe for PII redaction code) "metadata": {}, "outputs": [], "source": [ @@ -255,11 +150,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ -<<<<<<< HEAD - "#Step 2: Configure the PII Redactor\n" -======= "### Step 3: Configure the PII Redactor\n" ->>>>>>> 25a65d81 (notebook recipe for PII redaction code) ] }, { @@ -272,11 +163,7 @@ }, { "cell_type": "code", -<<<<<<< HEAD - "execution_count": 12, -======= "execution_count": 14, ->>>>>>> 25a65d81 (notebook recipe for PII redaction code) "metadata": {}, "outputs": [], "source": [ @@ -293,11 +180,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ -<<<<<<< HEAD - "Step 3: Initialize and Run the PII Redactor\n" -======= "### Step 4: Initialize and Run the PII Redactor\n" ->>>>>>> 25a65d81 (notebook recipe for PII redaction code) ] }, { @@ -309,16 +192,6 @@ }, { "cell_type": "code", -<<<<<<< HEAD - "execution_count": 13, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "20:33:16 INFO - Loading model from flair/ner-english-large\n" -======= "execution_count": 15, "metadata": {}, "outputs": [ @@ -338,16 +211,12 @@ "output_type": "stream", "text": [ "17:45:46 INFO - Loading model from flair/ner-english-large\n" ->>>>>>> 25a65d81 (notebook recipe for PII redaction code) ] }, { "name": "stdout", "output_type": "stream", "text": [ -<<<<<<< HEAD - "2024-11-24 20:33:33,105 SequenceTagger predicts: Dictionary with 20 tags: , O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, , \n" -======= "\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n", "You can now load the package via spacy.load('en_core_web_sm')\n", "\u001b[38;5;3m⚠ Restart to reload dependencies\u001b[0m\n", @@ -355,7 +224,6 @@ "order to load all the package's dependencies. You can do this by selecting the\n", "'Restart kernel' or 'Restart runtime' option.\n", "2024-11-25 17:46:04,004 SequenceTagger predicts: Dictionary with 20 tags: , O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, , \n" ->>>>>>> 25a65d81 (notebook recipe for PII redaction code) ] } ], @@ -368,11 +236,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ -<<<<<<< HEAD - "Step 4: Apply the Redactor to Text Data\n" -======= "### Step 5: Apply the Redactor to Text Data\n" ->>>>>>> 25a65d81 (notebook recipe for PII redaction code) ] }, { @@ -384,11 +248,7 @@ }, { "cell_type": "code", -<<<<<<< HEAD - "execution_count": 14, -======= "execution_count": 16, ->>>>>>> 25a65d81 (notebook recipe for PII redaction code) "metadata": {}, "outputs": [], "source": [ @@ -401,11 +261,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ -<<<<<<< HEAD - "Step 5: Display the Redaction Results\n" -======= "### Step 6: Display the Redaction Results\n" ->>>>>>> 25a65d81 (notebook recipe for PII redaction code) ] }, { @@ -417,11 +273,7 @@ }, { "cell_type": "code", -<<<<<<< HEAD - "execution_count": 15, -======= "execution_count": 17, ->>>>>>> 25a65d81 (notebook recipe for PII redaction code) "metadata": {}, "outputs": [ { @@ -465,8 +317,6 @@ "print(\"Redacted Text:\\n\", redacted_text)\n", "print(\"Detected Entities:\\n\", detected_entities)" ] -<<<<<<< HEAD -======= }, { "cell_type": "markdown", @@ -477,16 +327,11 @@ "\n", "### This notebook effectively demonstrates how to seamlessly apply redaction for PII entities" ] ->>>>>>> 25a65d81 (notebook recipe for PII redaction code) } ], "metadata": { "kernelspec": { -<<<<<<< HEAD - "display_name": "data-prep-kit-1", -======= "display_name": "Python 3 (ipykernel)", ->>>>>>> 25a65d81 (notebook recipe for PII redaction code) "language": "python", "name": "python3" }, @@ -504,9 +349,5 @@ } }, "nbformat": 4, -<<<<<<< HEAD - "nbformat_minor": 2 -======= "nbformat_minor": 4 ->>>>>>> 25a65d81 (notebook recipe for PII redaction code) } From 7282f1af0d66a68712c9cb47975df38485cfd3a6 Mon Sep 17 00:00:00 2001 From: pooja holkar <37286638+PoojaHolkar@users.noreply.github.com> Date: Wed, 27 Nov 2024 12:38:12 +0530 Subject: [PATCH 38/38] Delete examples/notebooks/PII/test Signed-off-by: Pooja Holkar --- examples/notebooks/PII/test | 1 - 1 file changed, 1 deletion(-) delete mode 100644 examples/notebooks/PII/test diff --git a/examples/notebooks/PII/test b/examples/notebooks/PII/test deleted file mode 100644 index 8b1378917..000000000 --- a/examples/notebooks/PII/test +++ /dev/null @@ -1 +0,0 @@ -