pdf2pdfocr

A tool to OCR a PDF (or supported images) and add a text "layer" (a "pdf sandwich") in the original file making it a searchable PDF. The script uses only open source tools.

donations

This software is free, but if you like it, please donate to support new features.

Bitcoin (BTC) address: 173D1zQQyzvCCCek9b1SpDvh7JikBEdtRJ

tips

Tips are also welcome!

Dogecoin (DOGE) address: D94hD2qPnkxmZk8qa1b6F1d7NfUrPkmcrG

PIX (Brazilian Instant Payments): 0726e8f2-7e59-488a-8abb-bda8f0d7d9ce

Please contact for donations and tips in other cryptocurrencies.

installation

In Linux, installation is straightforward. Just install required packages and be happy. You can use "install_command" script to copy required files to "/usr/local/bin".

In macOS, you will need macports.

# First install Xcode from Mac App Store, then:
xcode-select --install
sudo xcodebuild -license
# Install Macports from https://www.macports.org/install.php
sudo port selfupdate
# Install tesseract as main ocr engine (Portuguese included below - please add your preferred languages)
sudo port install git libtool automake autoconf tesseract tesseract-por tesseract-osd tesseract-eng
# Install cuneiform (the optional ocr engine - see flag "-c")
sudo port install cuneiform
# Install qpdf (optional for better performance)
sudo port install qpdf
# Install python 3 and other dependencies
sudo port install python39 py39-pip poppler poppler-data ImageMagick ghostscript
# Configure default python3 installer
sudo port select --set python python39
sudo port select --set python3 python39
sudo port select --set pip pip39
sudo port select --set pip3 pip39
# Configure venv and python deps in fixed home directory
python3 -m venv ~/pdf2pdfocr-venv
~/pdf2pdfocr-venv/bin/python3 -m pip install --upgrade pip
~/pdf2pdfocr-venv/bin/pip3 install --upgrade setuptools
~/pdf2pdfocr-venv/bin/pip3 install -r requirements.txt
~/pdf2pdfocr-venv/bin/pip3 install -r requirements_gui.txt
# Copy main scripts to venv
cp pdf2pdfocr.py pdf2pdfocr_gui.py pdf2pdfocr_multibackground.py ~/pdf2pdfocr-venv/bin
sudo ./install_command

Cuneiform and qpdf are optional.

In Windows, you will need to manually install required software. Please read "install_windows.txt" file and try the tutorial with scoop tool. It's easy! :-)

docker (without GUI)

The Dockerfile can be used to build a docker image to run pdf2pdfocr inside a container. To build the image, please download all sources and run.

docker build -t leofcardoso/pdf2pdfocr:latest .

It's also possible to pull the docker image from docker hub.

docker pull leofcardoso/pdf2pdfocr

You can run the application with docker run.

docker run --rm -v "$(pwd):/home/docker" leofcardoso/pdf2pdfocr -v -i ./sample_file.pdf

basic usage

This will create a searchable (OCR) PDF file in the same dir of "input_file".

pdf2pdfocr.py -i <input_file>

In some cases, you will want to deal with option flags. Please use:

pdf2pdfocr.py --help

to view all the options.

It's also possible to use GUI.

pdf2pdfocr_gui.py <<optional input file>>

fun

Caseiro com orgulho! ;-)

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
PDF2PDFOCR GUI.lnk		PDF2PDFOCR GUI.lnk
README.md		README.md
[Deprecated] Installing Windows tools for pdf2pdfocr.odt		[Deprecated] Installing Windows tools for pdf2pdfocr.odt
docker-wrapper.sh		docker-wrapper.sh
install_command		install_command
install_windows.txt		install_windows.txt
pdf2pdfocr		pdf2pdfocr
pdf2pdfocr.py		pdf2pdfocr.py
pdf2pdfocr.vbs		pdf2pdfocr.vbs
pdf2pdfocr_gui		pdf2pdfocr_gui
pdf2pdfocr_gui.py		pdf2pdfocr_gui.py
pdf2pdfocr_multibackground.py		pdf2pdfocr_multibackground.py
pix_qrcode.png		pix_qrcode.png
requirements.txt		requirements.txt
requirements_gui.txt		requirements_gui.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdf2pdfocr

donations

tips

installation

docker (without GUI)

basic usage

fun

About

Releases

Packages

Contributors 4

Languages

License

LeoFCardoso/pdf2pdfocr

Folders and files

Latest commit

History

Repository files navigation

pdf2pdfocr

donations

tips

installation

docker (without GUI)

basic usage

fun

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages