Table of contents:
- 1. Dataset Description
- 2. Considered Table Types and Header Cells
- 3. Table Storage and Annotation
- 4. Sample Format
- 5. Leader Board
- 6. Model Training and Evaluation
- 7. Limitations
IM-TQA is a Chinese table question answering dataset with 1,200 tables and 5,000 question-answer pairs, which highlights Implicit and Multi-type table structures for real-world TQA scenarios. It yields a more challenging table QA setting with two characteristics:
- models need to handle different types of tables. (i.e., Multi-type)
- header cell annotations are not provided to models directly. (i.e., Implicit)
By contrast, previous TQA benchmarks mainly focus on limited table types with explicit table structures (i.e., the model knows exactly which cells are headers). We collect multi-type tables and ask professional annotators to provide the following annotations: (1) table types, (2) header cell locations, (3) natural language look-up questions together with (4) their answer cell locations. More details, analyses, and baseline results can be found in the paper.
As shown in Figure 1, we divide tables into 4 types according to their structure characteristics, which is in line with previous works with complex table as an important complement. Exploring and including more table types deserve future investigations.
- Vertical Table: Table data is arranged in the vertical direction, with the first row as column headers and other rows as data tuples.
- Horizontal Table: Table data is arranged in the horizontal direction, with the first column as row headers and other columns as data tuples.
- Hierarchical Table: Table data is arranged in both vertical and horizontal directions, with headers exhibiting a multi-level hierarchical structure.
- Complex Table: In tables of above 3 types, header cells only locate on the top or left side of the table. But in complex tables, headers also appear at other positions such as bottom-right region in the table and can be mixed with data cells. Such tabular structures with flexible header locations often appear in professional equipment specifications and record sheets, presenting a great challenge to existing methods.
To promote the understanding of implicit table structures, we categorize table cells into 5 types based on their functional roles, with the concentration on header cells that are useful for TQA models to locate correct answer cells.
- Row Attribute and Column Attribute: Row attribute and column attribute are traditional table headers which describes other cells in the same row and in the same column respectively, e.g., yellow cells and red cells in Figure 1. Attribute cells only serve the purpose of describing other cells and they are not meaningful data.
- Row Index and Column Index: Row index and column index are individual cells that are used to index data records in the row or column orientation, e.g., blue cells and green cells in Figure 1. Index cells are also meaningful data. For instance, in vertical tables, data cells in the primary key column are unique identifiers of each row.
- Pure Data: Pure data cells are the core body of a table. They do not have the function of describing or indexing other cells and their meanings should be understood with the help of above header cells.
In order to store various tables, we design a storage method which separately stores cell positions
IM-TQA dataset consists of six .json
files for train/dev/test samples in the data
directory. train_tables.json
, dev_tables.json
, and test_tables.json
store table data and annotated header cells, and train_questions.json
, dev_questions.json
, and test_questions.json
store question-answer pairs. Table samples and question-answer pairs are dictionary objects. Though IM-TQA is collected from Chinese tables, we adopt a commercial machine translation model to translate tables and questions in IM-TQA from Chinese into English. But it should be noted that we did not double check the translation results so the translation quality may be poor.
Table sample format:
{
"table_id": "Z56mZoK9", # unique table id
"table_type": "vertical", # table type, possible table types: 'vertical', 'horizontal', 'hierarchical' or 'complex'.
"file_name": "垂直表格_216", # chinese table file name
"cell_ID_matrix": [[0,1,2,3], # cell_ID_matrix to store table layout information, which consists of several cell ID lists in the in the row-first order, e.g., [0,1,2,3] represents the first row.
[4,5,6,7]
,...,],
"chinese_cell_value_list": [ "序号", "客户", "销售金额", "年度销售占比%", "是否存在关联关系",...,], # cell_value_list to store cell content, which can be indexed by the cell ID in the cell_ID_matrix.
"english_cell_value_list": ["Serial No", "customer", "sales amount", "Proportion of annual sales%",...,], # cell_value_list translated into English.
"column_attribute": [0,1,2,3,4], # annotated cell ID list of different header cells.
"row_attribute": [],
"column_index": [],
"row_index": [5]
}
Question-answer pair sample format:
{
"table_id": "Z56mZoK9", # table_id is used to index the related table of each question.
"question_id": "Z56mZoK9_3", # unique question id
"file_name": "垂直表格_216", # chinese table file name
"chinese_question": "客户一的销售金额是多少?年度销售占比是多少?", # question text which is raised by annotators
"english_question": "What is the sales amount of Customer 1? What is the percentage of annual sales?", # english question text
"answer_cell_list": [7, 8], # cell id list of answer cells
"question_type": "arbitrary_cells" # question type, possible question types: 'single_cell', 'one_row', 'one_col' and 'arbitrary_cells'.
}
The dataset split statistics are shown below:
Train | Valid | Split | Total | |
---|---|---|---|---|
# tables | 936 | 111 | 153 | 1200 |
# questions | 3909 | 464 | 627 | 5000 |
# vertical tables | 224 | 31 | 45 | 300 |
# horizontal tables | 230 | 34 | 36 | 300 |
# hierarchical tables | 231 | 35 | 34 | 300 |
# complex tables | 251 | 11 | 38 | 300 |
We evaluate traditional TQA methods and recent powerful large language models (LLMs) like ChatGPT. (The LLM's output files are stored in the llm_outputs
directory.) From the results shown below, we can find that ChatGPT performs pretty well in handing look-up questions which select specific table cells as answers. This also demonstrates that more complicated questions are needed to present a comprehensive evaluation of LLM's table understanding ability. Some recent studies have made valid progress towards this goal, e.g., [1], [2].
Model | Exact Match Acc(%) | ||||
All Tables | Vertical | Horizontal | Hierarchical | Complex | |
Ernie-Layout | 11.6 | 11.5 | 4.10 | 5.66 | 22.6 |
Tapex | 13.1 | 14.9 | 10.7 | 8.18 | 17.4 |
RAT | 18.5 | 34.5 | 33.6 | 5.03 | 4.07 |
TAPAS | 33.2 | 58.0 | 31.1 | 26.4 | 15.7 |
RCI | 47.2 | 68.4 | 45.1 | 56.0 | 19.2 |
RCI-AIT | 49.6 | 69.5 | 43.4 | 60.4 | 23.8 |
RGCN-RCI | 53.4 | 70.7 | 45.9 | 62.9 | 32.0 |
ChatGPT
(zero-shot) | 92.3 | 93.1 | 92.6 | 91.2 | 92.2 |
Human | 95.1 | 96.6 | 95.1 | 94.3 | 94.1 |
We use PaddlePaddle to implement our model and all experiments were conducted on a NVIDIA TITAN RTX 24GB GPU. We understand that configuring the experiment environment with PaddlePaddle may encouter some problems and we suggest looking for solutions from the official PaddlePaddle Github Issues. The trained RGCN and RCI model weights can be downloaded from the Google Drive
conda create -n IM_TQA python=3.7
conda activate IM_TQA
pip install -r requirements.txt
The 'init_embedding_model' is the model name which is used to encode cell text to 768-dim semantic features. It will be passed to model.from_pretrained() and you can change the code to set it to the local path of your pre-downloaded model. The resulting PGL graph objects will be saved as pickle files (.pkl).
cd CTC_code
python convert_tables_to_graphs.py \
--tables_dir='../data/' \
--saved_graphs_dir='../data/' \
--init_embedding_model='bert-base-chinese'
# or you can directly run: sh build_graphs_based_on_tables.sh
The auto encoder is used to convert discrete 24-dim manual features to continuous 32-dim features. The resulting 32-dim cell features of each table will also be saved as pickle files (.pkl).
CUDA_VISIBLE_DEVICES=0 nohup python train_auto_encoder.py \
--run_num=1 \
--enc_hidden_dim=32 \
--manual_feat_dim=24 \
--random_seed=12345 \
--data_dir='../data/' \
--feats_save_dir='../data/' \
--model_save_dir='./saved_models/ctc_auto_encoder/' > ./log_files/train_auto_encoder_to_encode_manual_cell_feats.log &
# or you can directly run: sh train_auto_encoder.sh
python3 add_manual_feats_to_table_graphs.py
Make sure data paths in 'add_manual_feats_to_table_graphs.py' are correct and the resulting heter graphs with node features of two types will be saved as pickle files (.pkl).
This script will train an R-GCN model for CTC task using constructed heterogeneous graphs of the train split. It will save the best CTC model based on performance on validation split and predicted results (CTC task) of tables of each split will be saved for the subsequent table question answering (TQA) task. You can also save model of each epoch and select the best model based on you own metric.
sh train_ctc_gnn.sh
The implementation of TQA model is adapted from the codebase of the original RCI model which uses PyTorch.
First cd TQA_code
and construct row and column representations of train and test splits using build_RCI_train_and_test_data.ipynb
. Put the resulting files in the TQA_code/datasets/IM_TQA/
, which include 4 files (i.e., train_cols.jsonl.gz
, train_rows.jsonl.gz
, test_cols.jsonl.gz
and test_rows.jsonl.gz
).
The training task is a 2-class sentence-pair classification task. Given a row or a column representation and a input question, the bert-base-chinese model learns to predict whether this row or column contains the final answer cell(s). The trained model will be respectively saved at ./datasets/IM_TQA/bert-base-chinese-epoch3-warmup0.1/col_bert_base
and ./datasets/IM_TQA/bert-base-chinese-epoch3-warmup0.1/row_bert_base
.
sh train_RCI_bert.sh
In this step, the trained row and column model predicts whether a row or a column contains the answer cells. The inference results will be saved at ./datasets/IM_TQA/apply_bert/col_bert/results0.jsonl.gz
and ./datasets/IM_TQA/apply_bert/row_bert/results0.jsonl.gz
.
sh apply_RCI.sh
Based on the positive row ids and column ids, the predicted answer cell ids are extracted (i.e., cell_ID_matric[row_id][col_id]) and are compared with gold answer cell ids to compute exact match scores. Make sure related file path in the compute_RCI_exact_match.py
are correct (line 36-47). The predicted results of one run will be saved at ./datasets/IM_TQA/RGCN-RCI_test_pred_results.pkl
.
python compute_RCI_exact_match.py
Since we have provided results of Step 3 of one experiment, you can directly run the above command to validate its results. This should give:
(1) report on all tables:
total exact match score: 0.5311004784688995
correct question num: 333
total question num: 627
--------------------
(2) report on complex tables:
exact match score on complex tables: 0.3023255813953488
correct question num on complex tables: 52
total question num on complex tables: 172
--------------------
(3) report on vertical tables:
exact match score on vertical tables: 0.7126436781609196
correct question num on vertical tables: 124
total question num on vertical tables: 174
--------------------
(4) report on horizontal tables:
exact match score on horizontal tables: 0.45901639344262296
correct question num on horizontal tables: 56
total question num on horizontal tables: 122
--------------------
(5) report on hierarchical tables:
exact match score on hierarchical tables: 0.6352201257861635
correct question num on hierarchical tables: 101
total question num on hierarchical tables: 159
Though we made the first exploration towards real-life TQA scenarios with implicit and multi-type tables, this work faces some limitations:
- Our proposed dataset is in Chinese and focuses on single table. Though we translate the dataset from Chinese into English, we think it is better to directly construct a corresponding large-scale English TQA dataset in consideration of data quality. To build such a dataset with limited resource, one can fully utilize abundant tables in existing English TQA datasets.
- This work focuses on Lookup questions and more complicated questions which need multi-hop reasoning and numerical calculations are needed.
- Besides, more delicate taxonomies of table type and cell type also deserve future explorations.
If you find this work useful, please considering cite our work:
@inproceedings{zheng-etal-2023-im,
title = "{IM}-{TQA}: A {C}hinese Table Question Answering Dataset with Implicit and Multi-type Table Structures",
author = "Zheng, Mingyu and
Hao, Yang and
Jiang, Wenbin and
Lin, Zheng and
Lyu, Yajuan and
She, QiaoQiao and
Wang, Weiping",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.278",
doi = "10.18653/v1/2023.acl-long.278",
pages = "5074--5094",
}
This dataset follows the Computational Use of Data Agreement v1.0.
Despite our best efforts, there maybe still some errors in this dataset. If you have any question regarding the IM-TQA dataset, please create an issue in this repository. You can also reach us by e-mail addresses in the paper.