Skip to content

SalesforceAIResearch/TACO

Repository files navigation

🌮 TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action

If you like our project or are interested in its updates, please star us on GitHub :) Thank you! ⭐

News

🔥 [2024-12-09]: Code released!

What is TACO?

We introduce TACO as a family of multi-modal large action models designed to improve performance on such complex, multi-step and multi-modal tasks. During inference, TACO produces chains-of-thought-and–action (CoTA), executes intermediate steps by invoking external tools such as OCR, depth estimation and calculator, then integrates both the thoughts and action outputs to produce coherent responses. Our TACO models outperform the instruction-tuned baseline across 8 benchmarks, achieving a 3.6% improvement on average, with gains up to 15% in MMVet tasks involving OCR, mathematical reasoning and spatial reasoning.

Figure 1. TACO vs. other multi-modal models

Code usage

Installation

You can easily download the repo and set up the environment via:

git clone https://github.com/airesearch-emu/taco.git
cd taco

pip install -r requirements.txt

Note that this requirements.txt is mainly for running inference and eval with taco. For training taco, see the Training section for additional requirements.

Inference and Eval

Inference only

Run the python command below

python -m taco.run_multimodal_agent --execute --max-reply $max_reply --exp-id $exp_id --model $model  --dataset $dataset --infer-device-id "cuda:${device_id}" --prompt-format $format

For example,

python -m taco.run_multimodal_agent --execute --max-reply 10 --exp-id test --model gpt-4o-2024-08-06 --dataset MMVet --infer-device-id cuda:0 --prompt-format cota

Evalation only

Run the python command below:

python VLMEvalKit/run_eval_on_preds.py --data $dataset --result-file $prediction_file 

For example,

python VLMEvalKit/run_eval_on_preds.py --data MMVet --result-file prediction/gpt-4o-2024-08-06/mmvet/test-cota-max-reply-10-seed-42.jsonl

Infer and eval

Run the bash script below:

bash scripts/infer_and_eval_taco.sh $model $prompt_format $cuda_id $dataset1 $dataset2 ...

For example,

bash scripts/infer_and_eval_taco.sh gpt-4o-2024-08-06 cota 0 MMVP MMVet RealWorldQA MMStar 

Training

Note that we recommend setting up separate environments for training Mantis and LLaVA-OneVision models, as they share some common packages but require different versions.

Mantis

  1. Create a new environment and install required packages
pip install -r requirements_mantis.txt
  1. Prepare training data
    • a. Download Mantis training data to train_data/
    • b. Download images to to image_dir
    • c. Update path and image_dir in data config file
  2. Make necessary changes (e.g. conda environment and .bashrc) in the script bash scripts/train_taco_mantis_llava.sh
  3. Run the bash script below:
bash scripts/train_taco_mantis_llava.sh $model_name $data_config_file

LLaVA-OneVision

  1. Create a new environment and install required packages
pip install -r requirements_llava.txt
  1. Prepare training data
    • a. Download the training data to trian_data
  2. Make necessary changes (e.g. conda environment and .bashrc) in the script scripts/train_taco_llava_onevision.sh
  3. Run the bash script below:
bash scripts/train_taco_llava_onevision.sh $model_name $data_json_file

Notice

This release is for research purposes only in support of an academic paper. This repository is licensed under the noncommercial license CC-BY-NC 4.0. Some of our TACO models were built with Meta Llama 3, which is licensed under the Meta Llama 3 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.

Citation

Please cite us if you find our repository helpful. Thank you!

@misc{ma2024tacolearningmultimodalaction,
      title={TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action}, 
      author={Zixian Ma and Jianguo Zhang and Zhiwei Liu and Jieyu Zhang and Juntao Tan and Manli Shu and Juan Carlos Niebles and Shelby Heinecke and Huan Wang and Caiming Xiong and Ranjay Krishna and Silvio Savarese},
      year={2024},
      eprint={2412.05479},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.05479}, 
}