🔥 [2024-12-09]: Code released!
We introduce TACO as a family of multi-modal large action models designed to improve performance on such complex, multi-step and multi-modal tasks. During inference, TACO produces chains-of-thought-and–action (CoTA), executes intermediate steps by invoking external tools such as OCR, depth estimation and calculator, then integrates both the thoughts and action outputs to produce coherent responses. Our TACO models outperform the instruction-tuned baseline across 8 benchmarks, achieving a 3.6% improvement on average, with gains up to 15% in MMVet tasks involving OCR, mathematical reasoning and spatial reasoning.
Figure 1. TACO vs. other multi-modal models
You can easily download the repo and set up the environment via:
git clone https://github.com/airesearch-emu/taco.git
cd taco
pip install -r requirements.txt
Note that this requirements.txt
is mainly for running inference and eval with taco. For training taco, see the Training section for additional requirements.
Run the python command below
python -m taco.run_multimodal_agent --execute --max-reply $max_reply --exp-id $exp_id --model $model --dataset $dataset --infer-device-id "cuda:${device_id}" --prompt-format $format
For example,
python -m taco.run_multimodal_agent --execute --max-reply 10 --exp-id test --model gpt-4o-2024-08-06 --dataset MMVet --infer-device-id cuda:0 --prompt-format cota
Run the python command below:
python VLMEvalKit/run_eval_on_preds.py --data $dataset --result-file $prediction_file
For example,
python VLMEvalKit/run_eval_on_preds.py --data MMVet --result-file prediction/gpt-4o-2024-08-06/mmvet/test-cota-max-reply-10-seed-42.jsonl
Run the bash script below:
bash scripts/infer_and_eval_taco.sh $model $prompt_format $cuda_id $dataset1 $dataset2 ...
For example,
bash scripts/infer_and_eval_taco.sh gpt-4o-2024-08-06 cota 0 MMVP MMVet RealWorldQA MMStar
Note that we recommend setting up separate environments for training Mantis and LLaVA-OneVision models, as they share some common packages but require different versions.
- Create a new environment and install required packages
pip install -r requirements_mantis.txt
- Prepare training data
- a. Download Mantis training data to
train_data/
- b. Download images to to
image_dir
- c. Update
path
andimage_dir
in data config file
- a. Download Mantis training data to
- Make necessary changes (e.g. conda environment and .bashrc) in the script
bash scripts/train_taco_mantis_llava.sh
- Run the bash script below:
bash scripts/train_taco_mantis_llava.sh $model_name $data_config_file
- Create a new environment and install required packages
pip install -r requirements_llava.txt
- Prepare training data
- a. Download the training data to
trian_data
- a. Download the training data to
- Make necessary changes (e.g. conda environment and .bashrc) in the script
scripts/train_taco_llava_onevision.sh
- Run the bash script below:
bash scripts/train_taco_llava_onevision.sh $model_name $data_json_file
This release is for research purposes only in support of an academic paper. This repository is licensed under the noncommercial license CC-BY-NC 4.0. Some of our TACO models were built with Meta Llama 3, which is licensed under the Meta Llama 3 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.
Please cite us if you find our repository helpful. Thank you!
@misc{ma2024tacolearningmultimodalaction,
title={TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action},
author={Zixian Ma and Jianguo Zhang and Zhiwei Liu and Jieyu Zhang and Juntao Tan and Manli Shu and Juan Carlos Niebles and Shelby Heinecke and Huan Wang and Caiming Xiong and Ranjay Krishna and Silvio Savarese},
year={2024},
eprint={2412.05479},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.05479},
}