This repository contains the code, data and model for the paper titled "Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models".
- [2023-06-26] We released Math-LLaVA checkpoints. The Math-LLaVA-13B model achieves 46.6% on MathVista testmini, achieves 38.3% on MMMU, and achieves 15.69% on MATH-V.
- [2024-06-25] Release paper, code and MathV360K dataset.
cd Math-LLaVA
conda create -n math_llava python=3.10 -y
conda activate math_llava
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
"train_samples_all_tuning.json" corresponds to the annotations of qa pairs for finetuning. Download image dataset.
Place the data in the root directory or other directory. Data structure:
├── data_images/
│ ├── TabMWP/images/
│ ├── IconQA/images/
│ ├── ...
├── train_samples_all_tuning.json
sh finetune_task.sh
You can download and unzip images of MathVista using the following commands:
cd ./evaluation_mathvista/mathvista_data
wget https://huggingface.co/datasets/AI4Math/MathVista/resolve/main/images.zip
unzip images.zip
Generate the response on testmini subset:
cd evaluation_mathvista
python response.py --output_dir ./mathvista_outputs --output_file responses.json --model_path your/model/path --model_base None
Extract the short answer text for score calculation by ChatGPT. Please refer OpenAI API key.
python extract_answer.py --output_file responses.json
Calculate the final score:
python calculate_score.py --output_file responses.json --score_file responses_score.json
Generate the response:
cd eval_mmmu
python mmmu_response.py --output_path mmmu_eval_output.json --model_path
Calculate the score:
python mmmu_only_eval.py --output_path mmmu_eval_output.json --answer_path ./answer_dict_val.json
Accuracy scores on the testmini subset:
Model | ALL | FQA | GPS | MWP | TQA | VQA |
---|---|---|---|---|---|---|
miniGPT4-7B | 23.1 | 18.6 | 26.0 | 13.4 | 30.4 | 30.2 |
InstructBLIP-7B | 25.3 | 23.1 | 20.7 | 18.3 | 32.3 | 35.2 |
LLaVA-13B | 26.1 | 26.8 | 29.3 | 16.1 | 32.3 | 26.3 |
SPHINX-V1-13B | 27.5 | 23.4 | 23.1 | 21.5 | 39.9 | 34.1 |
LLaVA-1.5-13B | 27.6 | - | - | - | - | - |
OmniLMM-12B | 34.9 | 45.0 | 17.8 | 26.9 | 44.9 | 39.1 |
Math-LLaVA-13B | 46.6 | 37.2 | 57.7 | 56.5 | 51.3 | 33.5 |
Accuracy scores on the validation set:
Model | ALL |
---|---|
miniGPT4-7B | 26.8 |
mPLUG-Owl-7B | 32.7 |
InstructBLIP-7B | 32.9 |
SPHINX-13B | 32.9 |
LLaVA-1.5-13B | 36.4 |
Math-LLaVA-13B | 38.3 |
We also test on MATH-V, a more challenging dataset:
Model | ALL |
---|---|
Qwen-VL-Plus | 10.72 |
LLaVA-1.5-13B | 11.12 |
ShareGPT4V-13B | 11.88 |
InternLM-XComposer2-VL | 14.54 |
Math-LLaVA-13B | 15.69 |
The project is built on top of the amazing LLaVA repository, MathVista and MMMU. Thanks for their contributions!
If you find our code and dataset helpful to your research, please consider citing us with this BibTeX:
@misc{shihu2024mathllava,
title={Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models},
author={Wenhao Shi and Zhiqiang Hu and Yi Bin and Junhua Liu and Yang Yang and See-Kiong Ng and Lidong Bing and Roy Ka-Wei Lee},
year={2024},
eprint={2406.17294},
archivePrefix={arXiv},
primaryClass={cs.CL}
}