Many of PDF files that we have downloaded are digital-born, that is contain easily accessible text layer that PDF viewers use to display text. Some are definitely scanned documents, that do not have any text layer at all, some are searchable OCR-processed scans that contain a lot of hidden text.
Since we want to tell apart all of these categories, we need a tool to detect them. Thus this tool.
In order to run use java -jar
:
java -jar digital-born-pdf-scanner-0.0.1-SNAPSHOT-jar-with-dependencies.jar
Invoking the jar without parameters produces the following output:
Options:
-f, --filename
Filename to check in single file mode
-d, --input-dir
Input directory to look for PDF files
-o, --output-file-name
File to write results to. Supported extensions are *.tsv, *.csv
Default: results.tsv
-r, --recursive
Whether to search for PDF files recursively
Default: false
--sort
Whether to sort file name in results.
Default: false
-v, --verbose
Whether to print processed file names.
Default: false
Clearly, there are two modes of operation:
- single file mode (use
-f path-to-pdf-file
) - directory scan mode (use
-d path-to-directory-with-pdf-files
)
The latter can be used with recursive directory scan (use -r
), which searches subdirectories for PDF files.
The output will be stored to TAB-separated file results.tsv
unless different file name is provided. The output will be
either semicolon or TAB-separated depending on file extension.
Since there might be a lot of errors coming from failed files printed, it makes sense to redirect logs to a file, for example:
java -jar digital-born-pdf-scanner-0.0.1-SNAPSHOT-jar-with-dependencies.jar -d dir-with-pdfs -r --sort -v 2> error.log
Currently the only way to track processing progress is to enable verbose output (-v
) and couple it with file sorting
(--sort
). This will produce nice color output showing which file is successfully processed (or whether there was
processing failure). Please see example above.
The output file consists of following columns:
Column Name | Description |
---|---|
File Name | Path to PDF file |
Has Hidden Text | Is hidden text present in a document |
Visible Text Len | Length of visible text in a document |
Hidden Text Len | Length of hidden text in a document |
Creator | Name of a software that created a document (if any) |
Producer | Name of a software library used to produce a document (if any) |
Page Count | Number of pages in a document |
Max Covered Area Ratio | Maximal ratio of the largest image area to page area (often greater than 1...) |
Avg Covered Area Ratio | Average ratio of the largest image area to page area (often greater than 1...) |
Image Count | Number of images in a document |
Object Count | Number of objects in a document |
PDF Version | Version of PDF standard |
Has Outlines | Whether document contains outlines |
Is Tagged | Does document have tag structure |
Lang | Content language |
Conformance Level | Document conformance level |
Has Page Labels | Whether there are page labels in a document |
At a time of writing, the tool does not tell you if document is scanned, searchable scanned, or is digital born. However, certain heuristics can be deduced from the output:
- No text in a document (both visible and hidden text len equals 0) - this is a scanned document with 99% probability
- No visible text, a lot of hidden text, max covered area ≈ 1.0 - this must be a searchable scanned document
- No hidden text, a lot of visible text - this is probably digital-born document
The question is what is "a lot of text". Well, we have to check to know.