cpp11tesseract

cpp11tesseract is a fork of tesseract that uses cpp11 for those that require it for licensing or security purposes. It provides bindings to Tesseract-OCR, a powerful optical character recognition (OCR) engine that supports over 100 languages. The engine is highly configurable in order to tune the detection algorithms and obtain the best possible results.

Upstream Tesseract-OCR documentation: https://tesseract-ocr.github.io/tessdoc/
Introduction: https://docs.ropensci.org/tesseract/articles/intro.html
Reference: https://docs.ropensci.org/tesseract/reference/ocr.html

Simple example

How to extract text from an image:

library(cpp11tesseract)
text <- ocr("inst/examples/wilde.png")
cat(text)
#> Complete Works
#> oF
#> OSCAR WILDE
#> EDITED BY
#> 
#> ROBERT ROSS
#> MISCELLANIES
#> ‘AUTHORIZED EDITION
#> 
#> THE WYMAN-FOGG COMPANY
#> 
#> BOSTON :: MASSACHUSETTS

Differences with the original tesseract R package

This package initially started as a series of modifications to the original tesseract package to improve performance and add new features. Some of the changes contributed to the original included the functions to choose between the “best” and “fast” models.

However, some changes were not integrated, such as using the cpp11 package, which I need to comply with the Munk School IT standards. Using cpp11 allows me to vendor the C++ headers into the package, and then I can conduct an offline installation in the Niagara Cluster.

The documentation changed a bit. I tried to expand the documentation and compare with Amazon Textract output.

This package includes some changes requested by CRAN, and these are mostly about the package internals. For example, this version lists the dependencies to install in Linux and Mac, that you can install using apt/yum/brew, while the original package uses autobrew to install the Mac dependencies as binaries.

Installation

Installation from source on Linux or OSX requires the Tesseract library (see below).

Install from source

On Debian or Ubuntu install libtesseract-dev, libleptonica-dev, and tesseract-ocr-eng to run examples.

sudo apt-get install -y libtesseract-dev libleptonica-dev tesseract-ocr-eng

On Ubuntu you can optionally use this PPA to get the latest version of Tesseract:

sudo add-apt-repository ppa:alex-p/tesseract-ocr-devel
sudo apt-get install -y libtesseract-dev tesseract-ocr-eng

On Fedora you need tesseract-devel and leptonica-devel

sudo yum install tesseract-devel leptonica-devel

On RHEL and CentOS you need tesseract-devel and leptonica-devel from EPEL

sudo yum install epel-release
sudo yum install tesseract-devel leptonica-devel

On OS-X use tesseract from Homebrew:

brew install tesseract

Tesseract uses training data to perform OCR. Most systems default to English training data. To improve OCR results for other languages you can to install the appropriate training data. On Windows and OSX you can do this in R using tesseract_download():

tesseract_download('fra')

On Linux you need to install the appropriate training data from your distribution. For example to install the spanish training data you need tesseract-ocr-spa (Debian, Ubuntu) or tesseract-langpack-spa (Fedora, EPEL).

Alternatively you can manually download training data from github and store it in a path on disk that you pass in the datapath parameter or set a default path via the TESSDATA_PREFIX environment variable. Note that the Tesseract 4 and Tesseract 3 use different training data format. Make sure to download training data from the branch that matches your libtesseract version.

Testing with docker (development)

mkdir check
docker run -v `pwd`/check:/check ghcr.io/r-hub/containers/clang19:latest apt install apt-utils libcurl4-openssl-dev &\
  R -q -e "install.packages(c('Rcpp', 'jsonlite', 'curl', 'httr', 'yaml', 'rex', 'digest', 'crayon', 'withr', 'cli', 'magick', 'processx', 'tibble', 'V8', 'testthat', 'mockery', 'whoami', 'covr', 'asciicast'), repos = 'https://cloud.r-project.org')" &\
  r-check

Name		Name	Last commit message	Last commit date
Latest commit History 389 Commits
.github		.github
R		R
dev		dev
docs		docs
inst		inst
man		man
pkgdown/favicon		pkgdown/favicon
pypkg		pypkg
src		src
tests		tests
tools		tools
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.covrignore		.covrignore
.gitignore		.gitignore
CRAN-SUBMISSION		CRAN-SUBMISSION
DESCRIPTION		DESCRIPTION
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
cleanup		cleanup
configure		configure
configure.win		configure.win
cpp11tesseract.Rproj		cpp11tesseract.Rproj
cran-comments.md		cran-comments.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cpp11tesseract

Simple example

Differences with the original tesseract R package

Installation

Install from source

Testing with docker (development)

About

Releases 3

Packages

Languages

License

pachadotdev/cpp11tesseract

Folders and files

Latest commit

History

Repository files navigation

cpp11tesseract

Simple example

Differences with the original tesseract R package

Installation

Install from source

Testing with docker (development)

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages