Although great progress has been made by recent LLM-based table understanding methods, they rely heavily on the premise that given tables must be converted into a certain text sequence (such as Markdown or HTML) to serve as model input. However, it is difficult to access high-quality textual table representations in some real-world scenarios like scanned documents and webpage screentshots, and table images are much more accessible. Therefore, how to directly understand tables using intuitive visual information is a crucial and urgent challenge for developing more practical applications.
Facing the above challenge, we propose the multimodal table understanding problem, where the model is required to generate correct responses to different table-related requests (e.g., questions) in an end-to-end fashion based on the table image. Correspondingly, we construct MMTab, the first open-source large-scale dataset for multimodal table understanding problem, which can support both the training and evaluation of generalist MLLMs towards multimodal table understanding. Based on the curated MMTab dataset, we develop a versatile tabular MLLM named Table-LLaVA with an enhanced two-stage training paradigm of LLaVA v1.5. Table-LLaVA beats strong MLLM baselines on 17 held-in and 6 held-out benchmarks, and is even competitive with the powerful GPT-4V on 14 benchmarks under a subset of test samples. The right figure shows an intuitive comparison of Table LLaVA 7B and existing MLLMs on various multimodal table understanding benchmarks.
We constructed MMTab based on 14 publicly available table datasets of 8 domains. We carefully design scripts to convert original textual tables in these datasets into table images highlighting a broad coverage of table structures and styles, and transform all task-specific samples into multimodal instruction-tuning samples with a unified format of <table image, input request, output response>
. The resulting dataset contains three parts and can be downloaded from the Hugging Face Dataset. During the dataset
construction, data augmentations at multiple levels (e.g., table-level, task-level) were adopted to further improve the data diversity.
Dataset Split | #Table Images | #Samples |
---|---|---|
MMTab-pre | 97K | 150K table recognition samples for pre-training |
MMTab-instruct | 82K | 232K samples of 14 table-based tasks for instruction-tuning |
MMTab-eval | 23K | 45K samples of 17 held-in benchmarks and 4K samples of 7 held-out benchmarks for evaluation |
Dataset examples are shown in the following figure and more examples are shown in the Appendix A in the original paper.
Table LLaVA follows the LLaVA v1.5 architecture, with CLIP-ViT-L-336px as the visual encoder (336*336 image resolution), Vicuna-v1.5-7B or Vicuna-v1.5-13B as the base LLM and a two-layer MLP as the vision-language connector. The saved model checkpoints can be downloaded from the following Hugging Face Repository:
Version | Size | Schedule | Base LLM | Vision Encoder | Projection layer | Checkpoints |
---|---|---|---|---|---|---|
Table LLaVA | 7B | full_finetune-1_epoch | Vicuna-v1.5-7B | CLIP-ViT-L-336px | MLP-2x | SpursgoZmy/table-llava-v1.5-7b |
Table LLaVA | 13B | full_finetune-1_epoch | Vicuna-v1.5-13B | CLIP-ViT-L-336px | MLP-2x | SpursgoZmy/table-llava-v1.5-13b |
pretrained_mm_projector of Table LLaVA 7B | 5M | full_finetune-1_epoch | Vicuna-v1.5-7B | CLIP-ViT-L-336px | MLP-2x | SpursgoZmy/table-llava-v1.5-pretrained_mm_projector |
pretrained_mm_projector of Table LLaVA 13B | 5M | full_finetune-1_epoch | Vicuna-v1.5-13B | CLIP-ViT-L-336px | MLP-2x | SpursgoZmy/table-llava-v1.5-pretrained_mm_projector |
Note: The above Table-LLaVA checkpoints are saved from the original LLaVA repository, which is not directly compatible with the Transformers, i.e., it can not be directly loaded in the way like LlavaForConditionalGeneration.from_pretrained('SpursgoZmy/table-llava-v1.5-7b')
. This problem is mentioned in this github issue. I will try the provided conversion script to make Table-LLaVa checkpoints become compatible with Transformers and upload new checkpoints to a new hub. But for now, maybe the checkpoints can only be loaded with the LLaVA repository like this instead of directly loading from HuggingFace. Sorry for this inconvenience!
We use the code base of LLaVA v1.5 for model training and inference. Thus, Table LLaVA can be used as the normal LLaVA v1.5 model and the environment can be installed in a similar way. Note that our code base is downloaded in December 2023 and maybe not the latest. Please refer to the official LLaVA v1.5 github for its latest update.
- Clone this repository and navigate to Table-LLaVA folder
git clone https://github.com/SpursGoZmy/Table-LLaVA.git
cd Table-LLaVA
- Install Package
conda create -n table_llava python=3.10 -y
conda activate table_llava
pip install --upgrade pip # enable PEP 660 support
pip install -e .
Table LLaVA training consists of two stages: (1) Pre-training stage: the vision-language connector (a two-layer MLP) is trained to connect the frozen pretrained vision encoder (ViT) to the frozen LLM (Vicuna v1.5); (2) Instruction-tuning stage: the vision-language connector and the base LLM are trained to follow multimodal instructions.
The training data of each stage is shown below:
Training Stage | Data Description | Data Size | Hugging Face Dataset |
---|---|---|---|
Pre-training | 558K original LLaVA-1.5 pre-training data | 558K | blip_laion_cc_sbu_558k.json |
150K table recognition data (MMTab-pre) | 150K | MMTab-pre_pretrain_data_llava_format_150K.json | |
Instruction Fine-tuning | 665K original LLaVA-1.5 fine-tuning data | 665K | llava_v1_5_mix665k.json |
232K multimodal instruction tuning data of 14 tabular tasks (MMTab-instruct) | 232K | MMTab-instruct_sft_data_llava_format_232K.json |
The merged pre-training and instruction fine-tuning data in the LLaVA data format can be found in the MMTab dataset,
i.e., enhanced_llava_pretrain_data_708K.json
and enhanced_llava_sft_data_898K.json
, which can be directly used to train Table LLaVA.
Table LLaVA was trained on 8 A800 GPUs with 80GB memory. We use a similar set of hyperparameters as LLaVA v1.5 except that we increased the max sequence length from 2048 to 2560 to accommodate longer text sequences. The hyperparameters used in pretraining and finetuning are provided below.
Stage | Trained Weights | Global Batch Size | Learning rate | Epochs | Max length | Weight decay | warmup ratio | Deepspeed Stage |
---|---|---|---|---|---|---|---|---|
Pre-training | vision-language connector | 256 | 1e-3 | 1 | 2560 | 0 | 0.03 | ZeRO-2 |
Instruction Fine-tuning | base LLM and vision-language connector | 128 | 2e-5 | 1 | 2048 | 0 | 0.03 | ZeRO-3 |
- Download the original images for LLaVA v1.5 pretraining, i.e.,
images.zip
from here. Put it under./LLaVA-Pretrain/images
and unzip it. - Download
MMTab-instruct_table_images_82K.zip
andMMTab-pre_table_images_part_2_16K.zip
from MMTab dataset. Put them under./LLaVA-Pretrain/images
and unzip them. Rename theIID_train_image
dir totable_pretrain_part_1
. - Download
enhanced_llava_pretrain_data_708K.json
from MMTab dataset to./LLaVA-Pretrain
. - The resulting data should be organized as follows:
LLaVA-Pretrain
├── images
│ ├── table_pretrain_part_1
| ├── table_pretrain_part_2
| ├── 00453
| ├── 00019
| ├── ...
| └── 00095
└── enhanced_llava_pretrain_data_708K.json
- Training script with DeepSpeed ZeRO-2:
pretrain_table_llava.sh
. If you cannot automaticly download the base Vicuna v1.5 and ViT model through HuggingFace, you can download these models manually and set corresponding command-line parameters (model_name_or_path
andvision_tower
) to the local model paths. Once the pre-training is finished, the trained vision-language projector will be saved at the specifiedoutput_dir
.
- Create 5 new folders under
./LLaVA-Finetune/images
whose names arecoco
,gqa
,ocr_vqa
,textvqa
andvg
, respectively. Follow instructions from here to download images from these 5 datasets for LLaVA v1.5 fine-tuning. Put the zip files in the corresponding folders and unzip them. - Download
MMTab-instruct_table_images_82K.zip
from MMTab dataset. Put it under./LLaVA-Finetune/images/table_instructV
and unzip it. Rename the resultingIID_train_image
dir toimages
. - Download
enhanced_llava_sft_data_898K.json
from MMTab dataset to./LLaVA-Finetune
. - The resulting data should be organized as follows:
LLaVA-Finetune
├── images
│ ├── coco
| | └── train2017
| ├── gqa
| | └── images
| ├── ocr_vqa
| | └── images
| ├── textvqa
| | └── train_images
| ├── vg
| | ├── VG_100K
| | └── VG_100K_2
| ├── table_instructV
| | └── images
└── enhanced_llava_sft_data_898K.json
- Training script with DeepSpeed ZeRO-3:
continue_sft_table_llava.sh
. Set thepretrain_mm_mlp_adapter
parameter to the path of your pre-trained vision-language projector, such as./pretrained_mm_projector/llava-v1.5-7b-with-table-pretrain/mm_projector.bin
. The trained table llava model will be saved at the specifiedoutput_dir
.
The inference data should be stored in the LLaVA's jsonl format. Each line in the input file corresponds to an input sample, which is a JSON string (generated by json.dumps()
) of a Python dict. The sample format should look like:
{ "question_id": "TSD_test_item_17", # item_id
"image": "TABMWP_24663.jpg", # corresponding image file
"text": "This image displays a table. Could you provide me ...", # input text
"category": "TABMWP_for_TSD" # {dataset_name}_for_{task_type}, which can be used to separate data of different benchmarks.
}
For inference on the MMTab-eval, download the 49K MMTab-eval test samples in the jsonl format (MMTab-eval_test_data_49K_llava_jsonl_format.jsonl) and its image files (MMTab-eval_table_images_23K.zip). Then create a folder named 'LLaVA-Inference' and organize the data as follows:
LLaVA-Inference
├── MMTab-eval_test_data_49K_llava_jsonl_format.jsonl
└── all_test_image
Inference on multi-GPU: start_multicard_inference.sh
. You can also inference on your own data. Remember adjust parameters like 'question-file
' (input file path), 'image-folder
' (image folder path) in the table_llava_inference.sh
. The inference results (merge.jsonl
) will be stored in the path of the 'answers-file
' parameter, e.g., ./eval_results/answers/MMTab_eval/table-llava-v1.5-7b/merge.jsonl
.
With the offical inference script, the inference result format in the merge.jsonl
should look like:
{ 'question_id': 'TABMWP_8', # item_id
'prompt': 'Problem: \nHannah baked cookies each day ...', # input_prompt
'text': 'Find the numbers in the table.\n\nSaturday: ...', # model_output
'answer_id': 'jELcxSPcXHBj3xvHfm5r8T', # answer_id
'model_id': 'table-llava-7b', # model_id
'category': 'TABMWP_for_TQA'
} # item category
The evaluation scripts are stored in the MMTab-eval_evaluation
folder. First, cd MMTab-eval_evaluation
and pip install -r eval_requirements.txt
to install necessary packages like 'Sacrebleu' for evaluation. For table recognition task, we use the PubTabNet's TEDS computation script for evaluation. Then, download the MMTab-eval test data (MMTab-eval_test_data_49K.json) and test tables (MMTab-eval_test_tables_23K.json), and put them into the MMTab-eval_evaluation
folder together with the LLaVA's inference result (merge.jsonl
). Use the MMTab_evaluation.ipynb notebook for automatic evaluation.
For the evaluation on the ToTTo test set, you need to organize the model output into a txt file and upload it to the offical ToTTo leaderboard.
- Multilingual and multi-table scenarios. The proposed MMTab dataset mainly focus on the single table in English. The multi-table scenario with broader language coverage should be considered.
- Table images in the wild. MMTab is based on tables from academic table datasets and it contains diverse high-quality table images rendered by automatic scripts. Nevertheless, table images in the wild can be low-quality. For instance, blurred, handwritten or incomplete table images. To further bridge the gap between the academic research and the real application scenarios, more diversified table images from the wild could be collected in the future, and their corresponding instruction following data needs to be constructed.
- Improving image resolution. The supported image resolution of LLaVA-1.5 is relatively low and may limit the upper bound of its capacity. Luckily, with the emergence of MLLMs which possess higher and dynamic image resolution (e.g., LLaVA-Next and Qwen-VL), more powerful tabular MLLMs can be developed with the collected data.
- Upload the MMTab dataset to Hugging Face.
- Upload the Table LLaVA 7B and 13B model weights to Hugging Face.
- The code for model training.
- The code for model inference.
- The code for evaluation.
- Make the Table-LLaVA checkpoints compatible with the Transformers package, i.e., make it can be directly loaded in the way like
LlavaForConditionalGeneration.from_pretrained('SpursgoZmy/table-llava-v1.5-7b')
. This problem is mentioned in this issue
@misc{zheng2024multimodal,
title={Multimodal Table Understanding},
author={Mingyu Zheng and Xinwei Feng and Qingyi Si and Qiaoqiao She and Zheng Lin and Wenbin Jiang and Weiping Wang},
year={2024},
eprint={2406.08100},
archivePrefix={arXiv},
}
}