Awesome-Multimodal-Large-Language-Models

Our MLLM works

🔥🔥🔥 A Survey on Multimodal Large Language Models
Project Page [This Page] | Paper

The first survey for Multimodal Large Language Models (MLLMs). ✨

Welcome to add WeChat ID (wmd_ustc) to join our MLLM communication group! 🌟

🔥🔥🔥 MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Project Page [Leaderboards] | Paper

The first comprehensive evaluation benchmark for MLLMs. Now the leaderboards include 44 advanced models, such as Gemini Pro and GPT-4V. ✨

If you want to add your model in our leaderboards, please feel free to email [email protected]. We will update the leaderboards in time. ✨

Download MME 🌟🌟

The benchmark dataset is collected by Xiamen University for academic research only. You can email [email protected] to obtain the dataset, according to the following requirement.

Requirement: A real-name system is encouraged for better academic communication. Your email suffix needs to match your affiliation, such as [email protected] and Xiamen University. Otherwise, you need to explain why. Please include the information bellow when sending your application email.

Name: (tell us who you are.)
Affiliation: (the name/url of your university or company)
Job Title: (e.g., professor, PhD, and researcher)
Email: (your email address)
How to use: (only for non-commercial use)

🔥🔥🔥 Woodpecker: Hallucination Correction for Multimodal Large Language Models
Paper | Online Demo | Source Code

The first work to correct hallucinations in MLLMs. ✨

🔥🔥🔥 A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise

🍎 [Read our arXiv Paper]

The first technical report for Gemini vs GPT-4V. A total of 128 pages. ✨

Completed within one week of the Gemini API opening. 🌟

📑 If you find our projects helpful to your research, please consider citing:

@article{yin2023survey,
  title={A Survey on Multimodal Large Language Models},
  author={Yin, Shukang and Fu, Chaoyou and Zhao, Sirui and Li, Ke and Sun, Xing and Xu, Tong and Chen, Enhong},
  journal={arXiv preprint arXiv:2306.13549},
  year={2023}
}

@article{fu2023mme,
  title={MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models},
  author={Fu, Chaoyou and Chen, Peixian and Shen, Yunhang and Qin, Yulei and Zhang, Mengdan and Lin, Xu and Yang, Jinrui and Zheng, Xiawu and Li, Ke and Sun, Xing and Wu, Yunsheng and Ji, Rongrong},
  journal={arXiv preprint arXiv:2306.13394},
  year={2023}
}

@article{yin2023woodpecker,
  title={Woodpecker: Hallucination Correction for Multimodal Large Language Models},
  author={Yin, Shukang and Fu, Chaoyou and Zhao, Sirui and Xu, Tong and Wang, Hao and Sui, Dianbo and Shen, Yunhang and Li, Ke and Sun, Xing and Chen, Enhong},
  journal={arXiv preprint arXiv:2310.16045},
  year={2023}
}

@article{fu2023gemini,
  title={A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise},
  author={Fu, Chaoyou and Zhang, Renrui and Wang, Zihan and Huang, Yubo and Zhang, Zhengye and Qiu, Longtian and Ye, Gaoxiang and Shen, Yunhang and Zhang, Mengdan and Chen, Peixian and Zhao, Sirui and Lin, Shaohui and Jiang, Deqiang and Yin, Di and Gao, Peng and Li, Ke and Li, Hongsheng and Sun, Xing},
  journal={arXiv preprint arXiv:2312.12436},
  year={2023}
}

Table of Contents

Awesome Papers
Awesome Datasets

Awesome Papers

Multimodal Instruction Tuning

Title	Venue	Date	Code	Demo
MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices	arXiv	2023-12-28	Github	-
Osprey: Pixel Understanding with Visual Instruction Tuning	arXiv	2023-12-15	Github	Demo
CogAgent: A Visual Language Model for GUI Agents	arXiv	2023-12-14	Github	Coming soon
Pixel Aligned Language Models	arXiv	2023-12-14	Coming soon	-
See, Say, and Segment: Teaching LMMs to Overcome False Premises	arXiv	2023-12-13	Coming soon	-
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models	arXiv	2023-12-11	Github	Demo
Honeybee: Locality-enhanced Projector for Multimodal LLM	arXiv	2023-12-11	Github	-
Gemini: A Family of Highly Capable Multimodal Models	Google	2023-12-06	-	-
OneLLM: One Framework to Align All Modalities with Language	arXiv	2023-12-06	Github	Demo
Lenna: Language Enhanced Reasoning Detection Assistant	arXiv	2023-12-05	Github	-
Dolphins: Multimodal Language Model for Driving	arXiv	2023-12-01	Github	-
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning	arXiv	2023-11-30	Github	Coming soon
VTimeLLM: Empower LLM to Grasp Video Moments	arXiv	2023-11-30	Github	Local Demo
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models	arXiv	2023-11-28	Github	Coming soon
LLMGA: Multimodal Large Language Model based Generation Assistant	arXiv	2023-11-27	Github	Demo
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions	arXiv	2023-11-21	Github	Demo
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge	arXiv	2023-11-20	Github	-
An Embodied Generalist Agent in 3D World	arXiv	2023-11-18	Github	Demo
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning	arXiv	2023-11-13	Github	-
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models	arXiv	2023-11-13	Github	Demo
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models	arXiv	2023-11-11	Github	Demo
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents	arXiv	2023-11-09	Coming soon	Demo
NExT-Chat: An LMM for Chat, Detection and Segmentation	arXiv	2023-11-08	Github	Local Demo
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration	arXiv	2023-11-07	Github	Demo
OtterHD: A High-Resolution Multi-modality Model	arXiv	2023-11-07	Github	-
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding	arXiv	2023-11-06	Coming soon	-
GLaMM: Pixel Grounding Large Multimodal Model	arXiv	2023-11-06	Github	Demo
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning	arXiv	2023-11-02	Github	-
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning	arXiv	2023-10-14	Github	Local Demo
Ferret: Refer and Ground Anything Anywhere at Any Granularity	arXiv	2023-10-11	Github	-
CogVLM: Visual Expert For Large Language Models	arXiv	2023-10-09	Github	Demo
Improved Baselines with Visual Instruction Tuning	arXiv	2023-10-05	Github	Demo
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs	arXiv	2023-10-01	Github	-
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants	arXiv	2023-10-01	Github	Local Demo
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model	arXiv	2023-09-27	-	-
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition	arXiv	2023-09-26	Github	Local Demo
DreamLLM: Synergistic Multimodal Comprehension and Creation	arXiv	2023-09-20	Github	Coming soon
An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models	arXiv	2023-09-18	Coming soon	-
TextBind: Multi-turn Interleaved Multimodal Instruction-following	arXiv	2023-09-14	Github	Demo
NExT-GPT: Any-to-Any Multimodal LLM	arXiv	2023-09-11	Github	Demo
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics	arXiv	2023-09-13	Github	-
ImageBind-LLM: Multi-modality Instruction Tuning	arXiv	2023-09-07	Github	Demo
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning	arXiv	2023-09-05	-	-
PointLLM: Empowering Large Language Models to Understand Point Clouds	arXiv	2023-08-31	Github	Demo
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models	arXiv	2023-08-31	Github	Local Demo
MLLM-DataEngine: An Iterative Refinement Approach for MLLM	arXiv	2023-08-25	Github	-
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models	arXiv	2023-08-25	Github	Demo
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities	arXiv	2023-08-24	Github	Demo
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages	arXiv	2023-08-23	Github	Demo
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data	arXiv	2023-08-20	Github	-
BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions	arXiv	2023-08-19	Github	Demo
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions	arXiv	2023-08-08	Github	-
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World	arXiv	2023-08-03	Github	Demo
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding	arXiv	2023-07-31	Github	Local Demo
3D-LLM: Injecting the 3D World into Large Language Models	arXiv	2023-07-24	Github	-
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning	arXiv	2023-07-18	-	Demo
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs	arXiv	2023-07-17	Github	Demo
SVIT: Scaling up Visual Instruction Tuning	arXiv	2023-07-09	Github	-
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest	arXiv	2023-07-07	Github	Demo
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?	arXiv	2023-07-05	Github	-
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding	arXiv	2023-07-04	Github	Demo
Visual Instruction Tuning with Polite Flamingo	arXiv	2023-07-03	Github	Demo
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding	arXiv	2023-06-29	Github	Demo
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic	arXiv	2023-06-27	Github	Demo
MotionGPT: Human Motion as a Foreign Language	arXiv	2023-06-26	Github	-
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration	arXiv	2023-06-15	Github	Coming soon
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark	arXiv	2023-06-11	Github	Demo
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	arXiv	2023-06-08	Github	Demo
MI-IT: Multi-Modal In-Context Instruction Tuning	arXiv	2023-06-08	Github	Demo
M³IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning	arXiv	2023-06-07	-	-
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding	arXiv	2023-06-05	Github	Demo
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day	arXiv	2023-06-01	Github	-
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction	arXiv	2023-05-30	Github	Demo
PandaGPT: One Model To Instruction-Follow Them All	arXiv	2023-05-25	Github	Demo
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst	arXiv	2023-05-25	Github	-
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models	arXiv	2023-05-24	Github	Local Demo
DetGPT: Detect What You Need via Reasoning	arXiv	2023-05-23	Github	Demo
Pengi: An Audio Language Model for Audio Tasks	arXiv	2023-05-19	Github	-
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks	arXiv	2023-05-18	Github	-
Listen, Think, and Understand	arXiv	2023-05-18	Github	Demo
VisualGLM-6B	-	2023-05-17	Github	Local Demo
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering	arXiv	2023-05-17	Github	-
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning	arXiv	2023-05-11	Github	Local Demo
VideoChat: Chat-Centric Video Understanding	arXiv	2023-05-10	Github	Demo
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans	arXiv	2023-05-08	Github	Demo
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages	arXiv	2023-05-07	Github	-
LMEye: An Interactive Perception Network for Large Language Models	arXiv	2023-05-05	Github	Local Demo
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model	arXiv	2023-04-28	Github	Demo
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality	arXiv	2023-04-27	Github	Demo
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	arXiv	2023-04-20	Github	-
Visual Instruction Tuning	NeurIPS	2023-04-17	GitHub	Demo
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention	arXiv	2023-03-28	Github	Demo
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning	ACL	2022-12-21	Github	-

Multimodal In-Context Learning

Title	Venue	Date	Code	Demo
Hijacking Context in Large Multi-modal Models	arXiv	2023-12-07	-	-
Towards More Unified In-context Visual Understanding	arXiv	2023-12-05	-	-
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning	arXiv	2023-09-14	Github	Demo
Link-Context Learning for Multimodal LLMs	arXiv	2023-08-15	Github	Demo
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models	arXiv	2023-08-02	Github	Demo
Med-Flamingo: a Multimodal Medical Few-shot Learner	arXiv	2023-07-27	Github	Local Demo
Generative Pretraining in Multimodality	arXiv	2023-07-11	Github	Demo
AVIS: Autonomous Visual Information Seeking with Large Language Models	arXiv	2023-06-13	-	-
MIMIC-IT: Multi-Modal In-Context Instruction Tuning	arXiv	2023-06-08	Github	Demo
Exploring Diverse In-Context Configurations for Image Captioning	NeurIPS	2023-05-24	Github	-
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models	arXiv	2023-04-19	Github	Demo
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace	arXiv	2023-03-30	Github	Demo
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action	arXiv	2023-03-20	Github	Demo
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction	ICCV	2023-03-09	Github	-
Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering	CVPR	2023-03-03	Github	-
Visual Programming: Compositional visual reasoning without training	CVPR	2022-11-18	Github	Local Demo
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA	AAAI	2022-06-28	Github	-
Flamingo: a Visual Language Model for Few-Shot Learning	NeurIPS	2022-04-29	Github	Demo
Multimodal Few-Shot Learning with Frozen Language Models	NeurIPS	2021-06-25	-	-

Multimodal Chain-of-Thought

Title	Venue	Date	Code	Demo
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models	NeurIPS	2023-10-25	Github	-
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic	arXiv	2023-06-27	Github	Demo
Explainable Multimodal Emotion Reasoning	arXiv	2023-06-27	Github	-
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought	arXiv	2023-05-24	Github	-
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction	arXiv	2023-05-23	-	-
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering	arXiv	2023-05-05	-	-
Caption Anything: Interactive Image Description with Diverse Multimodal Controls	arXiv	2023-05-04	Github	Demo
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings	arXiv	2023-05-03	Coming soon	-
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models	arXiv	2023-04-19	Github	Demo
Chain of Thought Prompt Tuning in Vision Language Models	arXiv	2023-04-16	Coming soon	-
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action	arXiv	2023-03-20	Github	Demo
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models	arXiv	2023-03-08	Github	Demo
Multimodal Chain-of-Thought Reasoning in Language Models	arXiv	2023-02-02	Github	-
Visual Programming: Compositional visual reasoning without training	CVPR	2022-11-18	Github	Local Demo
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering	NeurIPS	2022-09-20	Github	-

LLM-Aided Visual Reasoning

Title	Venue	Date	Code	Demo
V∗: Guided Visual Search as a Core Mechanism in Multimodal LLMs	arXiv	2023-12-21	Github	Local Demo
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing	arXiv	2023-11-01	Github	Demo
MM-VID: Advancing Video Understanding with GPT-4V(vision)	arXiv	2023-10-30	-	-
ControlLLM: Augment Language Models with Tools by Searching on Graphs	arXiv	2023-10-26	Github	-
Woodpecker: Hallucination Correction for Multimodal Large Language Models	arXiv	2023-10-24	Github	Demo
MindAgent: Emergent Gaming Interaction	arXiv	2023-09-18	Github	-
LISA: Reasoning Segmentation via Large Language Model	arXiv	2023-08-01	Github	Demo
Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language	arXiv	2023-06-28	Github	Demo
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models	arXiv	2023-06-15	-	-
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn	arXiv	2023-06-14	Github	-
AVIS: Autonomous Visual Information Seeking with Large Language Models	arXiv	2023-06-13	-	-
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction	arXiv	2023-05-30	Github	Demo
Mindstorms in Natural Language-Based Societies of Mind	arXiv	2023-05-26	-	-
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models	arXiv	2023-05-24	Github	-
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models	arXiv	2023-05-24	Github	Local Demo
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation	arXiv	2023-05-10	Github	-
Caption Anything: Interactive Image Description with Diverse Multimodal Controls	arXiv	2023-05-04	Github	Demo
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models	arXiv	2023-04-19	Github	Demo
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace	arXiv	2023-03-30	Github	Demo
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action	arXiv	2023-03-20	Github	Demo
ViperGPT: Visual Inference via Python Execution for Reasoning	arXiv	2023-03-14	Github	Local Demo
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions	arXiv	2023-03-12	Github	Local Demo
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction	ICCV	2023-03-09	-	-
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models	arXiv	2023-03-08	Github	Demo
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners	CVPR	2023-03-03	Github	-
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models	arXiv	2022-11-28	Github	-
PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning	CVPR	2022-11-21	Github	-
Visual Programming: Compositional visual reasoning without training	CVPR	2022-11-18	Github	Local Demo
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language	arXiv	2022-04-01	Github	-

Foundation Models

Title	Venue	Date	Code	Demo
Gemini: A Family of Highly Capable Multimodal Models	Google	2023-12-06	-	-
Fuyu-8B: A Multimodal Architecture for AI Agents	blog	2023-10-17	Huggingface	Demo
Unified Model for Image, Video, Audio and Language Tasks	arXiv	2023-07-30	Github	Demo
PaLI-3 Vision Language Models: Smaller, Faster, Stronger	arXiv	2023-10-13	-	-
GPT-4V(ision) System Card	OpenAI	2023-09-25	-	-
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization	arXiv	2023-09-09	Github	-
Multimodal Foundation Models: From Specialists to General-Purpose Assistants	arXiv	2023-09-18	-	-
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training	NeurIPS	2023-07-13	Github	-
Generative Pretraining in Multimodality	arXiv	2023-07-11	Github	Demo
Kosmos-2: Grounding Multimodal Large Language Models to the World	arXiv	2023-06-26	Github	Demo
Transfer Visual Prompt Generator across LLMs	arXiv	2023-05-02	Github	Demo
GPT-4 Technical Report	arXiv	2023-03-15	-	-
PaLM-E: An Embodied Multimodal Language Model	arXiv	2023-03-06	-	Demo
Prismer: A Vision-Language Model with An Ensemble of Experts	arXiv	2023-03-04	Github	Demo
Language Is Not All You Need: Aligning Perception with Language Models	arXiv	2023-02-27	Github	-
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	arXiv	2023-01-30	Github	Demo
VIMA: General Robot Manipulation with Multimodal Prompts	ICML	2022-10-06	Github	Local Demo
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge	NeurIPS	2022-06-17	Github	-
Write and Paint: Generative Vision-Language Models are Unified Modal Learners	ICLR	2022-06-15	Github	-
Language Models are General-Purpose Interfaces	arXiv	2022-06-13	Github	-

Evaluation

Title	Venue	Date	Page
A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise	arXiv	2023-12-19	Github
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models	arXiv	2023-12-05	Github
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs	arXiv	2023-11-27	Github
MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V	arXiv	2023-11-23	Github
VLM-Eval: A General Evaluation on Video Large Language Models	arXiv	2023-11-20	Coming soon
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges	arXiv	2023-11-06	Github
On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving	arXiv	2023-11-09	Github
Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead	arXiv	2023-11-05	-
A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging	arXiv	2023-10-31	-
An Early Evaluation of GPT-4V(ision)	arXiv	2023-10-25	Github
Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation	arXiv	2023-10-25	Github
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models	arXiv	2023-10-23	Github
MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models	arXiv	2023-10-03	Github
Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations	arXiv	2023-10-02	Github
Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning	arXiv	2023-10-01	Github
Can We Edit Multimodal Large Language Models?	arXiv	2023-10-12	Github
REVO-LION: Evaluating and Refining Vision-Language Instruction Tuning Datasets	arXiv	2023-10-10	Github
The Dawn of LMMs: Preliminary Explorations with GPT-4V(vision)	arXiv	2023-09-29	-
TouchStone: Evaluating Vision-Language Models by Language Models	arXiv	2023-08-31	Github
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models	arXiv	2023-08-31	Github
SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs	arXiv	2023-08-07	Github
Tiny LVLM-eHub: Early Multimodal Experiments with Bard	arXiv	2023-08-07	Github
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities	arXiv	2023-08-04	Github
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension	arXiv	2023-07-30	Github
MMBench: Is Your Multi-modal Model an All-around Player?	arXiv	2023-07-12	Github
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models	arXiv	2023-06-23	Github
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models	arXiv	2023-06-15	Github
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark	arXiv	2023-06-11	Github
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models	arXiv	2023-06-08	Github
On The Hidden Mystery of OCR in Large Multimodal Models	arXiv	2023-05-13	Github

Multimodal Hallucination

Title	Venue	Date	Code	Demo
MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations	arXiv	2023-12-06	Github	-
Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites	arXiv	2023-12-04	Github	-
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback	arXiv	2023-12-01	Github	Demo
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation	arXiv	2023-11-29	Github	-
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding	arXiv	2023-11-28	Github	-
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization	arXiv	2023-11-28	Coming soon	-
Mitigating Hallucination in Visual Language Models with Visual Supervision	arXiv	2023-11-27	-	-
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data	arXiv	2023-11-22	Github	-
FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models	arXiv	2023-11-02	Github	-
Woodpecker: Hallucination Correction for Multimodal Large Language Models	arXiv	2023-10-24	Github	Demo
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models	arXiv	2023-10-09	-	-
HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption	arXiv	2023-10-03	-	-
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models	arXiv	2023-10-01	Github	-
Aligning Large Multimodal Models with Factually Augmented RLHF	arXiv	2023-09-25	Github	Demo
Evaluation and Mitigation of Agnosia in Multimodal Large Language Models	arXiv	2023-09-07	-	-
CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning	arXiv	2023-09-05	-	-
Evaluation and Analysis of Hallucination in Large Vision-Language Models	arXiv	2023-08-29	-	-
VIGC: Visual Instruction Generation and Correction	arXiv	2023-08-24	Github	Demo
Detecting and Preventing Hallucinations in Large Vision Language Models	arXiv	2023-08-11	-	-
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning	arXiv	2023-06-26	Github	Demo
Evaluating Object Hallucination in Large Vision-Language Models	EMNLP	2023-05-17	Github	-

Multimodal RLHF

Title	Venue	Date	Code	Demo
Silkie: Preference Distillation for Large Visual Language Models	arXiv	2023-12-17	Github	-
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback	arXiv	2023-12-01	Github	Demo
Aligning Large Multimodal Models with Factually Augmented RLHF	arXiv	2023-09-25	Github	Demo

Others

Title	Venue	Date	Code	Demo
VCoder: Versatile Vision Encoders for Multimodal Large Language Models	arXiv	2023-12-21	Github	Local Demo
Prompt Highlighter: Interactive Control for Multi-Modal LLMs	arXiv	2023-12-07	Github	-
Planting a SEED of Vision in Large Language Model	arXiv	2023-07-16	Github
Can Large Pre-trained Models Help Vision Models on Perception Tasks?	arXiv	2023-06-01	Github	-
Contextual Object Detection with Multimodal Large Language Models	arXiv	2023-05-29	Github	Demo
Generating Images with Multimodal Language Models	arXiv	2023-05-26	Github	-
On Evaluating Adversarial Robustness of Large Vision-Language Models	arXiv	2023-05-26	Github	-
Grounding Language Models to Images for Multimodal Inputs and Outputs	ICML	2023-01-31	Github	Demo

Awesome Datasets

Datasets of Pre-Training for Alignment

Name	Paper	Type	Modalities
ShareGPT4V	ShareGPT4V: Improving Large Multi-Modal Models with Better Captions	Caption	Image-Text
AS-1B	The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World	Hybrid	Image-Text
InternVid	InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation	Caption	Video-Text
MS-COCO	Microsoft COCO: Common Objects in Context	Caption	Image-Text
SBU Captions	Im2Text: Describing Images Using 1 Million Captioned Photographs	Caption	Image-Text
Conceptual Captions	Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning	Caption	Image-Text
LAION-400M	LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs	Caption	Image-Text
VG Captions	Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations	Caption	Image-Text
Flickr30k	Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models	Caption	Image-Text
AI-Caps	AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding	Caption	Image-Text
Wukong Captions	Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark	Caption	Image-Text
GRIT	Kosmos-2: Grounding Multimodal Large Language Models to the World	Caption	Image-Text-Bounding-Box
Youku-mPLUG	Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks	Caption	Video-Text
MSR-VTT	MSR-VTT: A Large Video Description Dataset for Bridging Video and Language	Caption	Video-Text
Webvid10M	Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval	Caption	Video-Text
WavCaps	WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research	Caption	Audio-Text
AISHELL-1	AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline	ASR	Audio-Text
AISHELL-2	AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale	ASR	Audio-Text
VSDial-CN	X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages	ASR	Image-Audio-Text

Datasets of Multimodal Instruction Tuning

Name	Paper	Link	Notes
M3DBench	M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts	Link	A large-scale 3D instruction tuning dataset
LVIS-Instruct4V	To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning	Link	A visual instruction dataset via self-instruction from GPT-4V
ComVint	What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning	Link	A synthetic instruction dataset for complex visual reasoning
SparklesDialogue	✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models	Link	A machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions to augment the conversational competence of instruction-following LLMs across multiple images and dialogue turns.
StableLLaVA	StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data	Link	A cheap and effective approach to collect visual instruction tuning data
M-HalDetect	Detecting and Preventing Hallucinations in Large Vision Language Models	Coming soon	A dataset used to train and benchmark models for hallucination detection and prevention
MGVLID	ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning	-	A high-quality instruction-tuning dataset including image-text and region-text pairs
BuboGPT	BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs	Link	A high-quality instruction-tuning dataset including audio-text audio caption data and audio-image-text localization data
SVIT	SVIT: Scaling up Visual Instruction Tuning	Link	A large-scale dataset with 4.2M informative visual instruction tuning data, including conversations, detailed descriptions, complex reasoning and referring QAs
mPLUG-DocOwl	mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding	Link	An instruction tuning dataset featuring a wide range of visual-text understanding tasks including OCR-free document understanding
PF-1M	Visual Instruction Tuning with Polite Flamingo	Link	A collection of 37 vision-language datasets with responses rewritten by Polite Flamingo.
LLaVAR	LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding	Link	A visual instruction-tuning dataset for Text-rich Image Understanding
MotionGPT	MotionGPT: Human Motion as a Foreign Language	Link	A instruction-tuning dataset including multiple human motion-related tasks
LRV-Instruction	Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning	Link	Visual instruction tuning dataset for addressing hallucination issue
Macaw-LLM	Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration	Link	A large-scale multi-modal instruction dataset in terms of multi-turn dialogue
LAMM-Dataset	LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark	Link	A comprehensive multi-modal instruction tuning dataset
Video-ChatGPT	Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	Link	100K high-quality video instruction dataset
MIMIC-IT	MIMIC-IT: Multi-Modal In-Context Instruction Tuning	Link	Multimodal in-context instruction tuning
M³IT	M³IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning	Link	Large-scale, broad-coverage multimodal instruction tuning dataset
LLaVA-Med	LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day	Coming soon	A large-scale, broad-coverage biomedical instruction-following dataset
GPT4Tools	GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction	Link	Tool-related instruction datasets
MULTIS	ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst	Coming soon	Multimodal instruction tuning dataset covering 16 multimodal tasks
DetGPT	DetGPT: Detect What You Need via Reasoning	Link	Instruction-tuning dataset with 5000 images and around 30000 query-answer pairs
PMC-VQA	PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering	Coming soon	Large-scale medical visual question-answering dataset
VideoChat	VideoChat: Chat-Centric Video Understanding	Link	Video-centric multimodal instruction dataset
X-LLM	X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages	Link	Chinese multimodal instruction dataset
LMEye	LMEye: An Interactive Perception Network for Large Language Models	Link	A multi-modal instruction-tuning dataset
cc-sbu-align	MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	Link	Multimodal aligned dataset for improving model's usability and generation's fluency
LLaVA-Instruct-150K	Visual Instruction Tuning	Link	Multimodal instruction-following data generated by GPT
MultiInstruct	MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning	Link	The first multimodal instruction tuning benchmark dataset

Datasets of In-Context Learning

Name	Paper	Link	Notes
MIC	MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning	Link	A manually constructed instruction tuning dataset including interleaved text-image inputs, inter-related multiple image inputs, and multimodal in-context learning inputs.
MIMIC-IT	MIMIC-IT: Multi-Modal In-Context Instruction Tuning	Link	Multimodal in-context instruction dataset

Datasets of Multimodal Chain-of-Thought

Name	Paper	Link	Notes
EMER	Explainable Multimodal Emotion Reasoning	Coming soon	A benchmark dataset for explainable emotion reasoning task
EgoCOT	EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought	Coming soon	Large-scale embodied planning dataset
VIP	Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction	Coming soon	An inference-time dataset that can be used to evaluate VideoCOT
ScienceQA	Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering	Link	Large-scale multi-choice dataset, featuring multimodal science questions and diverse domains

Datasets of Multimodal RLHF

Name	Paper	Link	Notes
VLFeedback	Silkie: Preference Distillation for Large Visual Language Models	Link	A vision-language feedback dataset annotated by AI

Benchmarks for Evaluation

Name	Paper	Link	Notes
M3DBench	M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts	Link	A 3D-centric benchmark
MLLM-Bench	MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V	Link	GPT-4V evaluation with per-sample criteria
BenchLMM	BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models	Link	A benchmark for assessment of the robustness against different image styles
MMC-Benchmark	MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning	Link	A comprehensive human-annotated benchmark with distinct tasks evaluating reasoning capabilities over charts
MVBench	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	Link	A comprehensive multimodal benchmark for video understanding
Bingo	Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges	Link	A benchmark for hallucination evaluation that focuses on two common types
MagnifierBench	OtterHD: A High-Resolution Multi-modality Model	Link	A benchmark designed to probe models' ability of fine-grained perception
HallusionBench	HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models	Link	An image-context reasoning benchmark for evaluation of hallucination
MMHal-Bench	Aligning Large Multimodal Models with Factually Augmented RLHF	Link	A benchmark for hallucination evaluation
MathVista	MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models	Link	A benchmark that challenges both visual and math reasoning capabilities
SparklesEval	✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models	Link	A GPT-assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns based on three distinct criteria.
ISEKAI	Link-Context Learning for Multimodal LLMs	Link	A benchmark comprising exclusively of unseen generated image-label pairs designed for link-context learning
M-HalDetect	Detecting and Preventing Hallucinations in Large Vision Language Models	Coming soon	A dataset used to train and benchmark models for hallucination detection and prevention
I4	Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions	Link	A benchmark to comprehensively evaluate the instruction following ability on complicated interleaved vision-language instructions
SciGraphQA	SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs	Link	A large-scale chart-visual question-answering dataset
MM-Vet	MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities	Link	An evaluation benchmark that examines large multimodal models on complicated multimodal tasks
SEED-Bench	SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension	Link	A benchmark for evaluation of generative comprehension in MLLMs
MMBench	MMBench: Is Your Multi-modal Model an All-around Player?	Link	A systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models
Lynx	What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?	Link	A comprehensive evaluation benchmark including both image and video tasks
GAVIE	Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning	Link	A benchmark to evaluate the hallucination and instruction following ability
MME	MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models	Link	A comprehensive MLLM Evaluation benchmark
LVLM-eHub	LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models	Link	An evaluation platform for MLLMs
LAMM-Benchmark	LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark	Link	A benchmark for evaluating the quantitative performance of MLLMs on various2D/3D vision tasks
M3Exam	M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models	Link	A multilingual, multimodal, multilevel benchmark for evaluating MLLM
OwlEval	mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality	Link	Dataset for evaluation on multiple capabilities

Others

Name	Paper	Link	Notes
IMAD	IMAD: IMage-Augmented multi-modal Dialogue	Link	Multimodal dialogue dataset
Video-ChatGPT	Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	Link	A quantitative evaluation framework for video-based dialogue models
CLEVR-ATVC	Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation	Link	A synthetic multimodal fine-tuning dataset for learning to reject instructions
Fruit-ATVC	Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation	Link	A manually pictured multimodal fine-tuning dataset for learning to reject instructions
InfoSeek	Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?	Link	A VQA dataset that focuses on asking information-seeking questions
OVEN	Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities	Link	A dataset that focuses on recognizing the Visual Entity on the Wikipedia, from images in the wild

Name		Name	Last commit message	Last commit date
Latest commit History 493 Commits
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-Multimodal-Large-Language-Models

Our MLLM works

Awesome Papers

Multimodal Instruction Tuning

Multimodal In-Context Learning

Multimodal Chain-of-Thought

LLM-Aided Visual Reasoning

Foundation Models

Evaluation

Multimodal Hallucination

Multimodal RLHF

Others

Awesome Datasets

Datasets of Pre-Training for Alignment

Datasets of Multimodal Instruction Tuning

Datasets of In-Context Learning

Datasets of Multimodal Chain-of-Thought

Datasets of Multimodal RLHF

Benchmarks for Evaluation

Others

About

Releases

Packages

Seongwoong-sk/Awesome-Multimodal-Large-Language-Models

Folders and files

Latest commit

History

Repository files navigation

Awesome-Multimodal-Large-Language-Models

Our MLLM works

Awesome Papers

Multimodal Instruction Tuning

Multimodal In-Context Learning

Multimodal Chain-of-Thought

LLM-Aided Visual Reasoning

Foundation Models

Evaluation

Multimodal Hallucination

Multimodal RLHF

Others

Awesome Datasets

Datasets of Pre-Training for Alignment

Datasets of Multimodal Instruction Tuning

Datasets of In-Context Learning

Datasets of Multimodal Chain-of-Thought

Datasets of Multimodal RLHF

Benchmarks for Evaluation

Others

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages