Skip to content

applicaai/digital-born-pdf-scanner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Digital-born PDF Scanner

Genesis

Many of PDF files that we have downloaded are digital-born, that is contain easily accessible text layer that PDF viewers use to display text. Some are definitely scanned documents, that do not have any text layer at all, some are searchable OCR-processed scans that contain a lot of hidden text.

Since we want to tell apart all of these categories, we need a tool to detect them. Thus this tool.

Usage

In order to run use java -jar:

java -jar digital-born-pdf-scanner-0.0.1-SNAPSHOT-jar-with-dependencies.jar

Invoking the jar without parameters produces the following output:

  Options:
    -f, --filename
      Filename to check in single file mode
    -d, --input-dir
      Input directory to look for PDF files
    -o, --output-file-name
      File to write results to. Supported extensions are *.tsv, *.csv
      Default: results.tsv
    -r, --recursive
      Whether to search for PDF files recursively
      Default: false
    --sort
      Whether to sort file name in results.
      Default: false
    -v, --verbose
      Whether to print processed file names.
      Default: false

Clearly, there are two modes of operation:

  • single file mode (use -f path-to-pdf-file)
  • directory scan mode (use -d path-to-directory-with-pdf-files)

The latter can be used with recursive directory scan (use -r), which searches subdirectories for PDF files.

The output will be stored to TAB-separated file results.tsv unless different file name is provided. The output will be either semicolon or TAB-separated depending on file extension.

Handling error log

Since there might be a lot of errors coming from failed files printed, it makes sense to redirect logs to a file, for example:

java -jar digital-born-pdf-scanner-0.0.1-SNAPSHOT-jar-with-dependencies.jar -d dir-with-pdfs -r --sort -v 2> error.log

Tracking progress

Currently the only way to track processing progress is to enable verbose output (-v) and couple it with file sorting (--sort). This will produce nice color output showing which file is successfully processed (or whether there was processing failure). Please see example above.

Output interpretation

The output file consists of following columns:

Column Name Description
File Name Path to PDF file
Has Hidden Text Is hidden text present in a document
Visible Text Len Length of visible text in a document
Hidden Text Len Length of hidden text in a document
Creator Name of a software that created a document (if any)
Producer Name of a software library used to produce a document (if any)
Page Count Number of pages in a document
Max Covered Area Ratio Maximal ratio of the largest image area to page area (often greater than 1...)
Avg Covered Area Ratio Average ratio of the largest image area to page area (often greater than 1...)
Image Count Number of images in a document
Object Count Number of objects in a document
PDF Version Version of PDF standard
Has Outlines Whether document contains outlines
Is Tagged Does document have tag structure
Lang Content language
Conformance Level Document conformance level
Has Page Labels Whether there are page labels in a document

At a time of writing, the tool does not tell you if document is scanned, searchable scanned, or is digital born. However, certain heuristics can be deduced from the output:

  • No text in a document (both visible and hidden text len equals 0) - this is a scanned document with 99% probability
  • No visible text, a lot of hidden text, max covered area ≈ 1.0 - this must be a searchable scanned document
  • No hidden text, a lot of visible text - this is probably digital-born document

The question is what is "a lot of text". Well, we have to check to know.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages