TODO: The content of output/
is still missing.
The following 13 PDF extraction tools were evaluated on their semantic abilities to extract the body texts from PDF files of scientific articles:
pdftotext, pdftohtml, pdf2xml (Xerox), pdf2xml (Tiedemann), PdfBox, ParsCit, LA-PdfText, PdfMiner, pdfXtk, pdf-extract, PDFExtract, Grobid, Icecite.
Each tool was used to extract texts from the PDF files of the benchmark. For each tool, reasonable input parameters were selected in order to get output files that reflect, as far as possible, the structure of ground truth files.
For tools with XML output, the output was translated to plain text by identifying the relevant text parts. If semantic roles were provided, only those logical text blocks were selected, which are also present in the ground truth files. If texts were broken down into any kind of blocks (like paragraphs, columns, or sections), they were separated by blank lines (like in the ground truth files).
There are three folders:
bin
contains all files needed to manage the extraction processes and to measure the evaluation criteria (see below).output
contains the output files of the extraction tools for each PDF file of the benchmark.
Files ending with.raw.txt
contain the original outputs of tools (usually in XML or plain text format).
Files ending with.final.txt
are the plain text files, translated from the original output.tools
contains tool-specific subfolders, one for each evaluated tool. Each contains
- a folder
bin
that contains the binaries of a tool to be used on the extraction process. - a file
tool_extractor.py
that contains the code to translate a.raw.txt
file into a.final.txt
file. - a file
config.ini
defining some tool-specific metadata, in particular the command to use on the extraction process. - a file
notices.txt
that contains some hints about the performed steps and some issues occurred on installing the tool.
For illustration, consider the folder tools/pdftotext
.
It contains the folder tools/pdftotext/bin
with the executable pdftotext
that is used on the extraction process.
The file tools/pdftotext/config.ini
with content
[general]
name = pdftotext
url = https://poppler.freedesktop.org/
info = Converts PDF files to plain text files.
[run]
cmd = ./bin/pdftotext -nopgbrk $IN $OUT
[extractor]
name = tool_extractor.PdfToTextExtractor
defines the name, a project url, a short info, the command to use on extraction (where $IN
is a placeholder for the path to a PDF file path and $OUT
is placeholder for the path to the output file) and the name of the class in tool_extractor.py
.
On evaluating a tool, each of its .final.txt
output files is compared with the equivalent ground truth file.
The following evaluation criteria are measured:
- NL+: the number of spurious newlines in the output file.
- NL–: the number of missing newlines in the output file.
- P+: the number of spurious paragraphs in the output file.
- P-: the number of missing paragraphs in the output file.
- P↕: the number of rearranged paragraphs in the output file.
- W+: the number of spurious words in the output file.
- W–: the number of missing words in the output file.
- W~: the number of misspelled words in the output file.
The following table summarizes the evaluation results of each evaluated tool, broken down by the introduced criteria.
Each criteria is given by 2 numbers. (1) An absolute number, and (2) a percentage, that gives, in case of NL+ and NL–, the absolute number relative to the number of newlines in the ground truth files and, in all other cases, the number of affected words relative to the number of words in the ground truth files.
The column ERR gives the number of PDF files which could not be processed by a tool. The column T∅ gives the average time needed to process a single PDF file, in seconds. The best values in each column are printed in bold.
Tool | NL+ | NL– | P+ | P- | P↕ | W+ | W- | W~ | ERR | T∅ |
---|---|---|---|---|---|---|---|---|---|---|
pdftotext | 14 (16%) |
44 (53%) |
60 (29%) |
2.3 (0.6%) |
1.4 (1.9%) |
24 (0.7%) |
2.4 (0.1%) |
41 (1.2%) |
2 | 0.3 |
pdftohtml | 3.6 (4.3%) |
70 (84%) |
9.2 (31%) |
4.2 (3.2%) |
0.1 (0.1%) |
16 (0.5%) |
1.6 (0.0%) |
95 (2.9%) |
0 | 2.2 |
pdftoxml | 33 (40%) |
20 (25%) |
80 (31%) |
1.8 (0.5%) |
0.1 (0.1%) |
21 (0.6%) |
1.5 (0.0%) |
154 (4.7%) |
1 | 0.7 |
PdfBox | 3.0 (3.6%) |
70 (85%) |
7.6 (27%) |
0.9 (0.2%) |
0.0 (0.1%) |
17 (0.5%) |
1.5 (0.0%) |
53 (1.6%) |
2 | 8.8 |
pdf2xml | 33 (40%) |
39 (48%) |
44 (21%) |
40 (30%) |
7.8 (9.5%) |
8.6 (0.3%) |
3.6 (0.1%) |
34 (0.9%) |
1444 | 37 |
ParsCit | 15 (18%) |
39 (47%) |
10 (10%) |
14 (6.4%) |
1.3 (1.8%) |
16 (0.5%) |
2.3 (0.1%) |
37 (1.1%) |
1 | 6.8 |
LA-PdfText | 5.5 (6.4%) |
23 (28%) |
4.8 (3.1%) |
52 (73%) |
2.9 (5.9%) |
5.7 (0.1%) |
6.1 (0.1%) |
26 (0.6%) |
324 | 24 |
PdfMiner | 32 (38%) |
18 (21%) |
84 (30%) |
3.6 (1.0%) |
1.4 (2.1%) |
34 (1.0%) |
2.6 (0.1%) |
110 (3.3%) |
23 | 16 |
pdfXtk | 7.9 (9.7%) |
68 (84%) |
12 (29%) |
4.5 (3.5%) |
0.1 (0.1%) |
59 (1.8%) |
6.1 (0.2%) |
95 (3.0%) |
739 | 22 |
pdf-extract | 95 (114%) |
53 (64%) |
99 (32%) |
8.4 (3.1%) |
4.1 (7.7%) |
74 (2.1%) |
41 (1.2%) |
149 (4.2%) |
72 | 34 |
PDFExtract | 9.5 (11%) |
33 (40%) |
28 (21%) |
22 (25%) |
0.8 (0.9%) |
12 (0.4%) |
2.8 (0.1%) |
61 (1.8%) |
176 | 46 |
Grobid | 9.5 (11%) |
30 (36%) |
7.5 (6.7%) |
11 (15%) |
0.0 (0.0%) |
14 (0.4%) |
1.6 (0.0%) |
63 (1.9%) |
29 | 42 |
Icecite | 3.4 (4.0%) |
10 (13%) |
6.2 (4.2%) |
7.7 (5.5%) |
0.1 (0.1%) |
10 (0.3%) |
1.7 (0.1%) |
21 (0.6%) |
34 | 41 |
To automate the extraction processes, you can use the file bin/extractor.py
.
Type bin/extractor.py --help
to get detailed usage infos.
Equivalently, to evaluate the output files against the ground truth, use the file bin/evaluator.py
.
Type bin/evaluator.py --help
to get detailed usage infos.
The Makefile
defines rules extract
and evaluate
that calls the executables with values adapted to our project.
Call it by typing make extract
or make evaluate
.