Skip to content

Latest commit

 

History

History

evaluation

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

The Evaluation

TODO: The content of output/ is still missing.

The following 13 PDF extraction tools were evaluated on their semantic abilities to extract the body texts from PDF files of scientific articles:

pdftotext, pdftohtml, pdf2xml (Xerox), pdf2xml (Tiedemann), PdfBox, ParsCit, LA-PdfText, PdfMiner, pdfXtk, pdf-extract, PDFExtract, Grobid, Icecite.

Each tool was used to extract texts from the PDF files of the benchmark. For each tool, reasonable input parameters were selected in order to get output files that reflect, as far as possible, the structure of ground truth files.

For tools with XML output, the output was translated to plain text by identifying the relevant text parts. If semantic roles were provided, only those logical text blocks were selected, which are also present in the ground truth files. If texts were broken down into any kind of blocks (like paragraphs, columns, or sections), they were separated by blank lines (like in the ground truth files).

Basic Structure

There are three folders:

  • bin contains all files needed to manage the extraction processes and to measure the evaluation criteria (see below).
  • output contains the output files of the extraction tools for each PDF file of the benchmark.
    Files ending with .raw.txt contain the original outputs of tools (usually in XML or plain text format).
    Files ending with .final.txt are the plain text files, translated from the original output.
  • tools contains tool-specific subfolders, one for each evaluated tool. Each contains
  • a folder bin that contains the binaries of a tool to be used on the extraction process.
  • a file tool_extractor.py that contains the code to translate a .raw.txt file into a .final.txt file.
  • a file config.ini defining some tool-specific metadata, in particular the command to use on the extraction process.
  • a file notices.txt that contains some hints about the performed steps and some issues occurred on installing the tool.

For illustration, consider the folder tools/pdftotext. It contains the folder tools/pdftotext/bin with the executable pdftotext that is used on the extraction process. The file tools/pdftotext/config.ini with content

[general]
name = pdftotext
url = https://poppler.freedesktop.org/
info = Converts PDF files to plain text files.

[run]
cmd = ./bin/pdftotext -nopgbrk $IN $OUT

[extractor]
name = tool_extractor.PdfToTextExtractor

defines the name, a project url, a short info, the command to use on extraction (where $IN is a placeholder for the path to a PDF file path and $OUT is placeholder for the path to the output file) and the name of the class in tool_extractor.py.

Evaluation Criteria

On evaluating a tool, each of its .final.txt output files is compared with the equivalent ground truth file. The following evaluation criteria are measured:

  • NL+: the number of spurious newlines in the output file.
  • NL–: the number of missing newlines in the output file.
  • P+: the number of spurious paragraphs in the output file.
  • P-: the number of missing paragraphs in the output file.
  • P↕: the number of rearranged paragraphs in the output file.
  • W+: the number of spurious words in the output file.
  • W–: the number of missing words in the output file.
  • W~: the number of misspelled words in the output file.

Evaluation Results

The following table summarizes the evaluation results of each evaluated tool, broken down by the introduced criteria.

Each criteria is given by 2 numbers. (1) An absolute number, and (2) a percentage, that gives, in case of NL+ and NL–, the absolute number relative to the number of newlines in the ground truth files and, in all other cases, the number of affected words relative to the number of words in the ground truth files.

The column ERR gives the number of PDF files which could not be processed by a tool. The column T∅ gives the average time needed to process a single PDF file, in seconds. The best values in each column are printed in bold.

Tool NL+ NL– P+ P- P↕ W+ W- W~ ERR T∅
pdftotext 14
(16%)
44
(53%)
60
(29%)
2.3
(0.6%)
1.4
(1.9%)
24
(0.7%)
2.4
(0.1%)
41
(1.2%)
2 0.3
pdftohtml 3.6
(4.3%)
70
(84%)
9.2
(31%)
4.2
(3.2%)
0.1
(0.1%)
16
(0.5%)
1.6
(0.0%)
95
(2.9%)
0 2.2
pdftoxml 33
(40%)
20
(25%)
80
(31%)
1.8
(0.5%)
0.1
(0.1%)
21
(0.6%)
1.5
(0.0%)
154
(4.7%)
1 0.7
PdfBox 3.0
(3.6%)
70
(85%)
7.6
(27%)
0.9
(0.2%)
0.0
(0.1%)
17
(0.5%)
1.5
(0.0%)
53
(1.6%)
2 8.8
pdf2xml 33
(40%)
39
(48%)
44
(21%)
40
(30%)
7.8
(9.5%)
8.6
(0.3%)
3.6
(0.1%)
34
(0.9%)
1444 37
ParsCit 15
(18%)
39
(47%)
10
(10%)
14
(6.4%)
1.3
(1.8%)
16
(0.5%)
2.3
(0.1%)
37
(1.1%)
1 6.8
LA-PdfText 5.5
(6.4%)
23
(28%)
4.8
(3.1%)
52
(73%)
2.9
(5.9%)
5.7
(0.1%)
6.1
(0.1%)
26
(0.6%)
324 24
PdfMiner 32
(38%)
18
(21%)
84
(30%)
3.6
(1.0%)
1.4
(2.1%)
34
(1.0%)
2.6
(0.1%)
110
(3.3%)
23 16
pdfXtk 7.9
(9.7%)
68
(84%)
12
(29%)
4.5
(3.5%)
0.1
(0.1%)
59
(1.8%)
6.1
(0.2%)
95
(3.0%)
739 22
pdf-extract 95
(114%)
53
(64%)
99
(32%)
8.4
(3.1%)
4.1
(7.7%)
74
(2.1%)
41
(1.2%)
149
(4.2%)
72 34
PDFExtract 9.5
(11%)
33
(40%)
28
(21%)
22
(25%)
0.8
(0.9%)
12
(0.4%)
2.8
(0.1%)
61
(1.8%)
176 46
Grobid 9.5
(11%)
30
(36%)
7.5
(6.7%)
11
(15%)
0.0
(0.0%)
14
(0.4%)
1.6
(0.0%)
63
(1.9%)
29 42
Icecite 3.4
(4.0%)
10
(13%)
6.2
(4.2%)
7.7
(5.5%)
0.1
(0.1%)
10
(0.3%)
1.7
(0.1%)
21
(0.6%)
34 41

Usage

To automate the extraction processes, you can use the file bin/extractor.py. Type bin/extractor.py --help to get detailed usage infos.

Equivalently, to evaluate the output files against the ground truth, use the file bin/evaluator.py. Type bin/evaluator.py --help to get detailed usage infos.

The Makefile defines rules extract and evaluate that calls the executables with values adapted to our project. Call it by typing make extract or make evaluate.