🔥 Project Page 📃 Paper 🐦 Twitter 🤗 Model and Data
We introduce VISUAL EMBEDDED INSTRUCTION (VIM), a new framework designed to evaluate the visual instruction following capability of Multimodal Large Language Models (MLLMs). VIM challenges the MLLMs by embedding the instructions into the visual scenes, demanding strong visual interpretative skills for instruction following. Please check out our paper "VIM: Probing Multimodal Large Language Models for Visual Embedded Instruction Following".
Probing results of five MLLMs for visual instruction following under our introduced VIM probing paradigm on four benchmarks VQAv2, MME, MM-Vet, and RefCOCO series, across three in-context learning settings ZS: Zero Shot , OS: One Shot, PS: Pair Shot.
Zero shot evaluation paradigm comparison for MLLMs. (a) Left: Image + Text instruction as two separate modalities are fed into MLLMs for inference; (b) Right: VIM only takes the image modality with the text instruction embedded in the image , no additional text prompt is required. The above example is from MM-Vet (question #86). Note: Image modality input , Text modality input.
Please follow install page to set up the environments and models.
You can load our v-mllm-7b and v-mllm-13b using LLaVA codebase since we have the same architecture. You can also use VLMEvalKit to load our model by adding two lines in the vlmeval/config.py in llava_series
:
'v-mllm_7b': partial(LLaVA, model_pth='VIM-Bench/v-mllm-7b'),
'v-mllm_13b': partial(LLaVA, model_pth='VIM-Bench/v-mllm-13b'),
And update model_name
in vlmeval/vlm/llava/llava.py template:
if model_pth in ['VIM-Bench/v-mllm-7b']:
model_name = 'llava-v1.5-7b'
elif model_pth in ['VIM-Bench/v-mllm-13b']:
model_name = 'llava-v1.5-13b'
We also support the already updated vlmeval kit in our repo in vlmeval. After you install enviroment from LLaVA and VLMEvalKit. You can directly run as below on one example:
python test_vmllm.py
First, please preprocess the dataset file accordingly. Then apply VIM to the source dataset (e.g., mme) under zero/one/pair shot setting accordingly.
Zero-shot:
bash scripts/convert_probe_bench.sh zs mme
One-shot:
bash scripts/convert_probe_bench.sh os mme
Pair-shot:
bash scripts/convert_probe_bench.sh ps mme
Three in-context evaluation settings: (a) Left: Zero Shot has only one question to be answered; (b) Middle: One Shot, the image is composed of one image-instruction-answer as a reference, the answer for the second image-instruction query is required; (c) Right: Pair Shot, the image is composed of two image-instruction pairs, and answer for both are required.
Main quantitative results over each benchmark, including sub set and full set for three settings.
Left: Exploration setup for instruction location on zero shot evaluation for MM-Vet. Right: Exploration setup for text prompt on zero shot evaluation for MM-Vet. * denotes from the origin paper reported.
Our results highlight a promising direction for the enhancement of MLLMs capabilities on instruction following. We aim VIM to serve as a useful norm for advancing the state of the art and driving further progress in the field.
If you found this repository useful, please consider cite our paper:
@misc{li2024text,
title={Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?},
author={Xiujun Li and Yujie Lu and Zhe Gan and Jianfeng Gao and William Yang Wang and Yejin Choi},
year={2024},
eprint={2311.17647},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{lu2023vim,
title={VIM: Probing Multimodal Large Language Models for Visual Embedded Instruction Following},
author={Yujie Lu and Xiujun Li and William Yang Wang and Yejin Choi},
year={2023},
eprint={2311.17647},
archivePrefix={arXiv},
primaryClass={cs.CV}
}