Skip to content

andreagemelli/cte-dataset

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Contextualized Table Extraction Dataset

The CTE Dataset

We have build new annotations for the defined Contextualized Table Extraction task, fusing two well known datasets:

  • PubLayNet1, a dataset for Document Layout Analysis with 5 different labeld regions;
  • PubTables-1M2, a dataset to perform Table Detection, Table Structure Recognition and Functional Analysis.

Tables are really important sources of information for research purposes, and giving them a context (instead of just focusing on them) can help in their extraction. We have been inspired by mainly two works:

  • DocBank3, to reformulate the problem as a token-classification task
  • AxCell4, to give table a context also for comparable resarch porpuses

You can read more details in our paper: CTE: Contextualized Table Extraction Dataset (under review)

About the PDF data: We do not own the copyright of the original data and we cannot redistribute them. The PDF files can be downloaded from here.


Generate CTE annotations

Run in your environment:

pip install -e .

to install dependencies.

After that, download:

  1. PubLayNet annotations from here
  2. PubTables-1M-PDF_Annotations_JSON from here And collocate them as described in Project Tree section.

Finally, to generate the annotations, run:

python src/generate_annotations.py

You will find train, val and test annotation json files in the data/merged subfolder.

Project Tree

  ├── setup.py - Initialization script
  ├── visualization.ipynb - Visualize annotations on example images
  │
  ├── src/
  │   ├── generate_annotations.py - Annotation json files generation 
  │   └── data/ - folder of scripts used by generate_annotations.py
  │
  ├── data/ - where papers and annotations are stored
  │   ├── publaynet/ - train, val, test jsons and PubLayNet_PDF folder
  │   ├── pubtables-1m/ - PubTables-1M-PDF_Annotations_JSON folder
  │   └── merged/
  │       ├── test.json - CTE annotations (as described in config file format section)
  │       ├── train.json - CTE annotations (as described in config file format section)
  │       └── val.json - (CTE annotations as described in config file format section)

Config File Format

Config files are in .json format. Example:

  "objects": 
      {
        "PMC#######_000##.pdf": 
          [
            [0, [157, 241, 807, 738], 1],
            [1, [157, 741, 807, 1238], 1],
            ...
          ]
        ...
      },
  "tokens":
      {
        "PMC#######_000##.pdf":
          [
            [0, [179, 241, 344, 271], 'Unfortunately,', 1, 0],
            [1, [354, 241, 412, 271], 'these', 1, 0],
            [2, [423, 241, 604, 271], 'quality-adjusted', 1, 0],
            ...
          ]
        ...
      }
  "links":
      {
        "PMC#######_000##.pdf":
          [
            ...,
            [9, 11, [31, 41]],
            [10, 12, [22, 23, 24, 25, 26, 27, 28, 29, 30, 31]],
            ...
          ]
        ...
      }

Each object has these information:

  • object id
  • bounding box coordinates
  • class id

Each token has these information:

  • token id
  • bounding box coordinates
  • text
  • class id
  • object id (to which it belongs)

Each link has these information:

  • link id
  • class id
  • token ids (list of tokens linked together)

Cite this project

If you want to use our dataset in your project1, please cite us:

@inproceedings{cte-2022,
    title = "CTE: Contextualized Table Extraction Dataset",
    author = "Gemelli, Andrea  and Vivoli, Emanuele and Marinai, Simone",
    year = "2022",
    volume = "",
    abstract = "Important information within a document are usually organized in tables, helping the reader with information retrieval and comparison. Most benchmark datasets support either document layout analysis or table understanding, but lack in providing data to apply both tasks in a unified way. We define the task of Contextualized Table Extraction (CTE), that aims to extract and structure tables considering the textual context of the document. The dataset comprises 75k fully annotated pages of scientific papers, with more than 35k tables and 13 classes. Data are gathered from PubMed Central, merging PubTables-1M and PubLayNet annotations in support of CTE, with the addition of new classes. Our proposed annotations can be used to develop end-to-end pipelines for various tasks, including document layout analysis and table detection, structure recognition and functional analysis. We formally define CTE and evaluation metrics, showing which subtasks can be tackle in its support, describing advantages, limitations and future works of this collection of data. Annotations and code will be accessible at https://github.com/AILab-UniFI/cte-dataset",
    conference
}

Footnotes

  1. Xu Zhong et al., PubLayNet: largest dataset ever for document layout analysis, ICDAR 2019. 2

  2. B. Smock et al., "Towards a universal dataset and metrics for training and evaluating table extraction models", arXiv, November 2021.

  3. Li, Minghao, et al. "DocBank: A benchmark dataset for document layout analysis." arXiv preprint arXiv:2006.01038 (2020).

  4. Kardas, Marcin, et al. "Axcell: Automatic extraction of results from machine learning papers." arXiv preprint arXiv:2004.14356 (2020)

About

CTE: Contextualized Table Extraction Dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%