Name	Name	Last commit message	Last commit date
parent directory ..
bin	bin
output	output
tools	tools
Makefile	Makefile
README.md	README.md

The Evaluation

TODO: The content of output/ is still missing.

The following 13 PDF extraction tools were evaluated on their semantic abilities to extract the body texts from PDF files of scientific articles:

pdftotext, pdftohtml, pdf2xml (Xerox), pdf2xml (Tiedemann), PdfBox, ParsCit, LA-PdfText, PdfMiner, pdfXtk, pdf-extract, PDFExtract, Grobid, Icecite.

Each tool was used to extract texts from the PDF files of the benchmark. For each tool, reasonable input parameters were selected in order to get output files that reflect, as far as possible, the structure of ground truth files.

For tools with XML output, the output was translated to plain text by identifying the relevant text parts. If semantic roles were provided, only those logical text blocks were selected, which are also present in the ground truth files. If texts were broken down into any kind of blocks (like paragraphs, columns, or sections), they were separated by blank lines (like in the ground truth files).

Basic Structure

There are three folders:

bin contains all files needed to manage the extraction processes and to measure the evaluation criteria (see below).
output contains the output files of the extraction tools for each PDF file of the benchmark.
Files ending with .raw.txt contain the original outputs of tools (usually in XML or plain text format).
Files ending with .final.txt are the plain text files, translated from the original output.
tools contains tool-specific subfolders, one for each evaluated tool. Each contains

a folder bin that contains the binaries of a tool to be used on the extraction process.
a file tool_extractor.py that contains the code to translate a .raw.txt file into a .final.txt file.
a file config.ini defining some tool-specific metadata, in particular the command to use on the extraction process.
a file notices.txt that contains some hints about the performed steps and some issues occurred on installing the tool.

For illustration, consider the folder tools/pdftotext. It contains the folder tools/pdftotext/bin with the executable pdftotext that is used on the extraction process. The file tools/pdftotext/config.ini with content

[general]
name = pdftotext
url = https://poppler.freedesktop.org/
info = Converts PDF files to plain text files.

[run]
cmd = ./bin/pdftotext -nopgbrk $IN $OUT

[extractor]
name = tool_extractor.PdfToTextExtractor

defines the name, a project url, a short info, the command to use on extraction (where $IN is a placeholder for the path to a PDF file path and $OUT is placeholder for the path to the output file) and the name of the class in tool_extractor.py.

Evaluation Criteria

On evaluating a tool, each of its .final.txt output files is compared with the equivalent ground truth file. The following evaluation criteria are measured:

NL+: the number of spurious newlines in the output file.
NL–: the number of missing newlines in the output file.
P+: the number of spurious paragraphs in the output file.
P-: the number of missing paragraphs in the output file.
P↕: the number of rearranged paragraphs in the output file.
W+: the number of spurious words in the output file.
W–: the number of missing words in the output file.
W~: the number of misspelled words in the output file.

Evaluation Results

The following table summarizes the evaluation results of each evaluated tool, broken down by the introduced criteria.

Each criteria is given by 2 numbers. (1) An absolute number, and (2) a percentage, that gives, in case of NL+ and NL–, the absolute number relative to the number of newlines in the ground truth files and, in all other cases, the number of affected words relative to the number of words in the ground truth files.

The column ERR gives the number of PDF files which could not be processed by a tool. The column T∅ gives the average time needed to process a single PDF file, in seconds. The best values in each column are printed in bold.

Tool	NL+	NL–	P+	P-	P↕	W+	W-	W~	ERR	T∅
pdftotext	14 ^(16%)	44 ^(53%)	60 ^(29%)	2.3 ^(0.6%)	1.4 ^(1.9%)	24 ^(0.7%)	2.4 ^(0.1%)	41 ^(1.2%)	2	0.3
pdftohtml	3.6 ^(4.3%)	70 ^(84%)	9.2 ^(31%)	4.2 ^(3.2%)	0.1 ^(0.1%)	16 ^(0.5%)	1.6 ^(0.0%)	95 ^(2.9%)	0	2.2
pdftoxml	33 ^(40%)	20 ^(25%)	80 ^(31%)	1.8 ^(0.5%)	0.1 ^(0.1%)	21 ^(0.6%)	1.5 ^(0.0%)	154 ^(4.7%)	1	0.7
PdfBox	3.0 ^(3.6%)	70 ^(85%)	7.6 ^(27%)	0.9 ^(0.2%)	0.0 ^(0.1%)	17 ^(0.5%)	1.5 ^(0.0%)	53 ^(1.6%)	2	8.8
pdf2xml	33 ^(40%)	39 ^(48%)	44 ^(21%)	40 ^(30%)	7.8 ^(9.5%)	8.6 ^(0.3%)	3.6 ^(0.1%)	34 ^(0.9%)	1444	37
ParsCit	15 ^(18%)	39 ^(47%)	10 ^(10%)	14 ^(6.4%)	1.3 ^(1.8%)	16 ^(0.5%)	2.3 ^(0.1%)	37 ^(1.1%)	1	6.8
LA-PdfText	5.5 ^(6.4%)	23 ^(28%)	4.8 ^(3.1%)	52 ^(73%)	2.9 ^(5.9%)	5.7 ^(0.1%)	6.1 ^(0.1%)	26 ^(0.6%)	324	24
PdfMiner	32 ^(38%)	18 ^(21%)	84 ^(30%)	3.6 ^(1.0%)	1.4 ^(2.1%)	34 ^(1.0%)	2.6 ^(0.1%)	110 ^(3.3%)	23	16
pdfXtk	7.9 ^(9.7%)	68 ^(84%)	12 ^(29%)	4.5 ^(3.5%)	0.1 ^(0.1%)	59 ^(1.8%)	6.1 ^(0.2%)	95 ^(3.0%)	739	22
pdf-extract	95 ^(114%)	53 ^(64%)	99 ^(32%)	8.4 ^(3.1%)	4.1 ^(7.7%)	74 ^(2.1%)	41 ^(1.2%)	149 ^(4.2%)	72	34
PDFExtract	9.5 ^(11%)	33 ^(40%)	28 ^(21%)	22 ^(25%)	0.8 ^(0.9%)	12 ^(0.4%)	2.8 ^(0.1%)	61 ^(1.8%)	176	46
Grobid	9.5 ^(11%)	30 ^(36%)	7.5 ^(6.7%)	11 ^(15%)	0.0 ^(0.0%)	14 ^(0.4%)	1.6 ^(0.0%)	63 ^(1.9%)	29	42
Icecite	3.4 ^(4.0%)	10 ^(13%)	6.2 ^(4.2%)	7.7 ^(5.5%)	0.1 ^(0.1%)	10 ^(0.3%)	1.7 ^(0.1%)	21 ^(0.6%)	34	41

Usage

To automate the extraction processes, you can use the file bin/extractor.py. Type bin/extractor.py --help to get detailed usage infos.

Equivalently, to evaluate the output files against the ground truth, use the file bin/evaluator.py. Type bin/evaluator.py --help to get detailed usage infos.

The Makefile defines rules extract and evaluate that calls the executables with values adapted to our project. Call it by typing make extract or make evaluate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation

evaluation

README.md

The Evaluation

Basic Structure

Evaluation Criteria

Evaluation Results

Usage

Files

evaluation

Directory actions

More options

Directory actions

More options

Latest commit

History

evaluation

Folders and files

parent directory

README.md

The Evaluation

Basic Structure

Evaluation Criteria

Evaluation Results

Usage