We have build new annotations for the defined Contextualized Table Extraction task, fusing two well known datasets:
- PubLayNet1, a dataset for Document Layout Analysis with 5 different labeld regions;
- PubTables-1M2, a dataset to perform Table Detection, Table Structure Recognition and Functional Analysis.
Tables are really important sources of information for research purposes, and giving them a context (instead of just focusing on them) can help in their extraction. We have been inspired by mainly two works:
- DocBank3, to reformulate the problem as a token-classification task
- AxCell4, to give table a context also for comparable resarch porpuses
You can read more details in our paper: CTE: Contextualized Table Extraction Dataset
(under review)
About the PDF data: We do not own the copyright of the original data and we cannot redistribute them. The PDF files can be downloaded from here.
Run in your environment:
pip install -e .
to install dependencies.
After that, download:
- PubLayNet annotations from here
- PubTables-1M-PDF_Annotations_JSON from here And collocate them as described in Project Tree section.
Finally, to generate the annotations, run:
python src/generate_annotations.py
You will find train, val and test annotation json files in the data/merged subfolder.
├── setup.py - Initialization script
├── visualization.ipynb - Visualize annotations on example images
│
├── src/
│ ├── generate_annotations.py - Annotation json files generation
│ └── data/ - folder of scripts used by generate_annotations.py
│
├── data/ - where papers and annotations are stored
│ ├── publaynet/ - train, val, test jsons and PubLayNet_PDF folder
│ ├── pubtables-1m/ - PubTables-1M-PDF_Annotations_JSON folder
│ └── merged/
│ ├── test.json - CTE annotations (as described in config file format section)
│ ├── train.json - CTE annotations (as described in config file format section)
│ └── val.json - (CTE annotations as described in config file format section)
Config files are in .json
format. Example:
"objects":
{
"PMC#######_000##.pdf":
[
[0, [157, 241, 807, 738], 1],
[1, [157, 741, 807, 1238], 1],
...
]
...
},
"tokens":
{
"PMC#######_000##.pdf":
[
[0, [179, 241, 344, 271], 'Unfortunately,', 1, 0],
[1, [354, 241, 412, 271], 'these', 1, 0],
[2, [423, 241, 604, 271], 'quality-adjusted', 1, 0],
...
]
...
}
"links":
{
"PMC#######_000##.pdf":
[
...,
[9, 11, [31, 41]],
[10, 12, [22, 23, 24, 25, 26, 27, 28, 29, 30, 31]],
...
]
...
}
Each object has these information:
- object id
- bounding box coordinates
- class id
Each token has these information:
- token id
- bounding box coordinates
- text
- class id
- object id (to which it belongs)
Each link has these information:
- link id
- class id
- token ids (list of tokens linked together)
If you want to use our dataset in your project1, please cite us:
@inproceedings{cte-2022,
title = "CTE: Contextualized Table Extraction Dataset",
author = "Gemelli, Andrea and Vivoli, Emanuele and Marinai, Simone",
year = "2022",
volume = "",
abstract = "Important information within a document are usually organized in tables, helping the reader with information retrieval and comparison. Most benchmark datasets support either document layout analysis or table understanding, but lack in providing data to apply both tasks in a unified way. We define the task of Contextualized Table Extraction (CTE), that aims to extract and structure tables considering the textual context of the document. The dataset comprises 75k fully annotated pages of scientific papers, with more than 35k tables and 13 classes. Data are gathered from PubMed Central, merging PubTables-1M and PubLayNet annotations in support of CTE, with the addition of new classes. Our proposed annotations can be used to develop end-to-end pipelines for various tasks, including document layout analysis and table detection, structure recognition and functional analysis. We formally define CTE and evaluation metrics, showing which subtasks can be tackle in its support, describing advantages, limitations and future works of this collection of data. Annotations and code will be accessible at https://github.com/AILab-UniFI/cte-dataset",
conference
}
Footnotes
-
Xu Zhong et al., PubLayNet: largest dataset ever for document layout analysis, ICDAR 2019. ↩ ↩2
-
B. Smock et al., "Towards a universal dataset and metrics for training and evaluating table extraction models", arXiv, November 2021. ↩
-
Li, Minghao, et al. "DocBank: A benchmark dataset for document layout analysis." arXiv preprint arXiv:2006.01038 (2020). ↩
-
Kardas, Marcin, et al. "Axcell: Automatic extraction of results from machine learning papers." arXiv preprint arXiv:2004.14356 (2020) ↩