Contextualized Table Extraction Dataset

Contextualized Table Extraction Dataset

The CTE Dataset

We have build new annotations for the defined Contextualized Table Extraction task, fusing two well known datasets:

PubLayNet¹, a dataset for Document Layout Analysis with 5 different labeld regions;
PubTables-1M², a dataset to perform Table Detection, Table Structure Recognition and Functional Analysis.

Tables are really important sources of information for research purposes, and giving them a context (instead of just focusing on them) can help in their extraction. We have been inspired by mainly two works:

DocBank³, to reformulate the problem as a token-classification task
AxCell⁴, to give table a context also for comparable resarch porpuses

You can read more details in our paper: CTE: Contextualized Table Extraction Dataset (under review)

About the PDF data: We do not own the copyright of the original data and we cannot redistribute them. The PDF files can be downloaded from here.

Generate CTE annotations

Run in your environment:

pip install -e .

to install dependencies.

After that, download:

PubLayNet annotations from here
PubTables-1M-PDF_Annotations_JSON from here And collocate them as described in Project Tree section.

Finally, to generate the annotations, run:

python src/generate_annotations.py

You will find train, val and test annotation json files in the data/merged subfolder.

Project Tree

  ├── setup.py - Initialization script
  ├── visualization.ipynb - Visualize annotations on example images
  │
  ├── src/
  │   ├── generate_annotations.py - Annotation json files generation 
  │   └── data/ - folder of scripts used by generate_annotations.py
  │
  ├── data/ - where papers and annotations are stored
  │   ├── publaynet/ - train, val, test jsons and PubLayNet_PDF folder
  │   ├── pubtables-1m/ - PubTables-1M-PDF_Annotations_JSON folder
  │   └── merged/
  │       ├── test.json - CTE annotations (as described in config file format section)
  │       ├── train.json - CTE annotations (as described in config file format section)
  │       └── val.json - (CTE annotations as described in config file format section)

Config File Format

Config files are in .json format. Example:

  "objects": 
      {
        "PMC#######_000##.pdf": 
          [
            [0, [157, 241, 807, 738], 1],
            [1, [157, 741, 807, 1238], 1],
            ...
          ]
        ...
      },
  "tokens":
      {
        "PMC#######_000##.pdf":
          [
            [0, [179, 241, 344, 271], 'Unfortunately,', 1, 0],
            [1, [354, 241, 412, 271], 'these', 1, 0],
            [2, [423, 241, 604, 271], 'quality-adjusted', 1, 0],
            ...
          ]
        ...
      }
  "links":
      {
        "PMC#######_000##.pdf":
          [
            ...,
            [9, 11, [31, 41]],
            [10, 12, [22, 23, 24, 25, 26, 27, 28, 29, 30, 31]],
            ...
          ]
        ...
      }

Each object has these information:

object id
bounding box coordinates
class id

Each token has these information:

token id
bounding box coordinates
text
class id
object id (to which it belongs)

Each link has these information:

link id
class id
token ids (list of tokens linked together)

Cite this project

If you want to use our dataset in your project¹, please cite us:

@inproceedings{cte-2022,
    title = "CTE: Contextualized Table Extraction Dataset",
    author = "Gemelli, Andrea  and Vivoli, Emanuele and Marinai, Simone",
    year = "2022",
    volume = "",
    abstract = "Important information within a document are usually organized in tables, helping the reader with information retrieval and comparison. Most benchmark datasets support either document layout analysis or table understanding, but lack in providing data to apply both tasks in a unified way. We define the task of Contextualized Table Extraction (CTE), that aims to extract and structure tables considering the textual context of the document. The dataset comprises 75k fully annotated pages of scientific papers, with more than 35k tables and 13 classes. Data are gathered from PubMed Central, merging PubTables-1M and PubLayNet annotations in support of CTE, with the addition of new classes. Our proposed annotations can be used to develop end-to-end pipelines for various tasks, including document layout analysis and table detection, structure recognition and functional analysis. We formally define CTE and evaluation metrics, showing which subtasks can be tackle in its support, describing advantages, limitations and future works of this collection of data. Annotations and code will be accessible at https://github.com/AILab-UniFI/cte-dataset",
    conference
}

Xu Zhong et al., PubLayNet: largest dataset ever for document layout analysis, ICDAR 2019. ↩ ↩²
B. Smock et al., "Towards a universal dataset and metrics for training and evaluating table extraction models", arXiv, November 2021. ↩
Li, Minghao, et al. "DocBank: A benchmark dataset for document layout analysis." arXiv preprint arXiv:2006.01038 (2020). ↩
Kardas, Marcin, et al. "Axcell: Automatic extraction of results from machine learning papers." arXiv preprint arXiv:2004.14356 (2020) ↩

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
src		src
.gitignore		.gitignore
README.md		README.md
annotations.png		annotations.png
print_examples.py		print_examples.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contextualized Table Extraction Dataset

The CTE Dataset

Generate CTE annotations

Project Tree

Config File Format

Cite this project

About

Releases

Packages

Languages

andreagemelli/cte-dataset

Folders and files

Latest commit

History

Repository files navigation

Contextualized Table Extraction Dataset

The CTE Dataset

Generate CTE annotations

Project Tree

Config File Format

Cite this project

Footnotes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages