🔥 [2024-12-10]: Data released!
We generate Chains-of-Thought-and–Action (CoTA) data automatically with two approaches as shown below: model-based generation (top) and programmatic generation (bottom).
Figure 1. CoTA data generation method
In model-based generation, we take existing image and QA pairs as inputs and prompt a large language model (i.e. GPT-4o) to generate either a chain-of-thought-and-action (CoTA) or chain-of-thought (CoT) without actions to answer the questions. Then, we verify that the chains lead to correct final answers and parse successfully; if not, we convert them into the direct answer (Direct) format with groundtruth answers. In programmatic generation, we first annotate images with human labelers and models, and then use the dense annotations to fill in manually written templates and generate QA and the corresponding CoTA with Python programs.
You can easily download the repo and set up the environment via:
git clone https://github.com/airesearch-emu/cota.git
cd cota
pip install -r requirements.txt
- Step 1: Modify the environment and code paths in the script
scripts/generate_mm_trajs.sh
- Step 2: Run the script
generate_mm_trajs.sh $subset
, where$subset
is a string representing a subset of the Cauldron dataset, ormantis-$subset
for Mantis-Instruct. For example, for Cauldron:generate_mm_trajs.sh ai2d
; for Mantis:generate_mm_trajs.sh mantis-contrastive_caption
.
-
Generate CoTA for single-image examples:
python cota/gen_tool_single.py
-
For multi-image examples:
python cota/gen_tool_multi.py
The CoTA datasets are licensed under the noncommerical license CC-BY-NC 4.0. Users need to make their own assessment regarding any obligations or responsibilities under the corresponding licenses or terms and conditions pertaining to the original datasets and data. This release is for research purposes only in support of an academic paper.
Please cite us if you find our repository helpful. Thank you!
@misc{ma2024tacolearningmultimodalaction,
title={TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action},
author={Zixian Ma and Jianguo Zhang and Zhiwei Liu and Jieyu Zhang and Juntao Tan and Manli Shu and Juan Carlos Niebles and Shelby Heinecke and Huan Wang and Caiming Xiong and Ranjay Krishna and Silvio Savarese},
year={2024},
eprint={2412.05479},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.05479},
}