Skip to content

✨✨Latest Papers and Datasets on Multimodal Large Language Models, and Their Evaluation.

Notifications You must be signed in to change notification settings

Seongwoong-sk/Awesome-Multimodal-Large-Language-Models

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Awesome-Multimodal-Large-Language-Models

Our MLLM works

🔥🔥🔥 A Survey on Multimodal Large Language Models
Project Page [This Page] | Paper

The first survey for Multimodal Large Language Models (MLLMs). ✨

Welcome to add WeChat ID (wmd_ustc) to join our MLLM communication group! 🌟


🔥🔥🔥 MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Project Page [Leaderboards] | Paper

The first comprehensive evaluation benchmark for MLLMs. Now the leaderboards include 44 advanced models, such as Gemini Pro and GPT-4V. ✨

If you want to add your model in our leaderboards, please feel free to email [email protected]. We will update the leaderboards in time. ✨

Download MME 🌟🌟

The benchmark dataset is collected by Xiamen University for academic research only. You can email [email protected] to obtain the dataset, according to the following requirement.

Requirement: A real-name system is encouraged for better academic communication. Your email suffix needs to match your affiliation, such as [email protected] and Xiamen University. Otherwise, you need to explain why. Please include the information bellow when sending your application email.

Name: (tell us who you are.)
Affiliation: (the name/url of your university or company)
Job Title: (e.g., professor, PhD, and researcher)
Email: (your email address)
How to use: (only for non-commercial use)

🔥🔥🔥 Woodpecker: Hallucination Correction for Multimodal Large Language Models
Paper | Online Demo | Source CodeStar

The first work to correct hallucinations in MLLMs. ✨


🔥🔥🔥 A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise

The first technical report for Gemini vs GPT-4V. A total of 128 pages. ✨
Completed within one week of the Gemini API opening. 🌟


📑 If you find our projects helpful to your research, please consider citing:

@article{yin2023survey,
  title={A Survey on Multimodal Large Language Models},
  author={Yin, Shukang and Fu, Chaoyou and Zhao, Sirui and Li, Ke and Sun, Xing and Xu, Tong and Chen, Enhong},
  journal={arXiv preprint arXiv:2306.13549},
  year={2023}
}

@article{fu2023mme,
  title={MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models},
  author={Fu, Chaoyou and Chen, Peixian and Shen, Yunhang and Qin, Yulei and Zhang, Mengdan and Lin, Xu and Yang, Jinrui and Zheng, Xiawu and Li, Ke and Sun, Xing and Wu, Yunsheng and Ji, Rongrong},
  journal={arXiv preprint arXiv:2306.13394},
  year={2023}
}

@article{yin2023woodpecker,
  title={Woodpecker: Hallucination Correction for Multimodal Large Language Models},
  author={Yin, Shukang and Fu, Chaoyou and Zhao, Sirui and Xu, Tong and Wang, Hao and Sui, Dianbo and Shen, Yunhang and Li, Ke and Sun, Xing and Chen, Enhong},
  journal={arXiv preprint arXiv:2310.16045},
  year={2023}
}

@article{fu2023gemini,
  title={A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise},
  author={Fu, Chaoyou and Zhang, Renrui and Wang, Zihan and Huang, Yubo and Zhang, Zhengye and Qiu, Longtian and Ye, Gaoxiang and Shen, Yunhang and Zhang, Mengdan and Chen, Peixian and Zhao, Sirui and Lin, Shaohui and Jiang, Deqiang and Yin, Di and Gao, Peng and Li, Ke and Li, Hongsheng and Sun, Xing},
  journal={arXiv preprint arXiv:2312.12436},
  year={2023}
}

Table of Contents


Awesome Papers

Multimodal Instruction Tuning

Title Venue Date Code Demo
Star
MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices
arXiv 2023-12-28 Github -
Star
Osprey: Pixel Understanding with Visual Instruction Tuning
arXiv 2023-12-15 Github Demo
Star
CogAgent: A Visual Language Model for GUI Agents
arXiv 2023-12-14 Github Coming soon
Pixel Aligned Language Models arXiv 2023-12-14 Coming soon -
See, Say, and Segment: Teaching LMMs to Overcome False Premises arXiv 2023-12-13 Coming soon -
Star
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
arXiv 2023-12-11 Github Demo
Star
Honeybee: Locality-enhanced Projector for Multimodal LLM
arXiv 2023-12-11 Github -
Gemini: A Family of Highly Capable Multimodal Models Google 2023-12-06 - -
Star
OneLLM: One Framework to Align All Modalities with Language
arXiv 2023-12-06 Github Demo
Star
Lenna: Language Enhanced Reasoning Detection Assistant
arXiv 2023-12-05 Github -
Star
Dolphins: Multimodal Language Model for Driving
arXiv 2023-12-01 Github -
Star
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
arXiv 2023-11-30 Github Coming soon
Star
VTimeLLM: Empower LLM to Grasp Video Moments
arXiv 2023-11-30 Github Local Demo
Star
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
arXiv 2023-11-28 Github Coming soon
Star
LLMGA: Multimodal Large Language Model based Generation Assistant
arXiv 2023-11-27 Github Demo
Star
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
arXiv 2023-11-21 Github Demo
Star
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
arXiv 2023-11-20 Github -
Star
An Embodied Generalist Agent in 3D World
arXiv 2023-11-18 Github Demo
Star
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
arXiv 2023-11-13 Github -
Star
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
arXiv 2023-11-13 Github Demo
Star
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
arXiv 2023-11-11 Github Demo
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents arXiv 2023-11-09 Coming soon Demo
Star
NExT-Chat: An LMM for Chat, Detection and Segmentation
arXiv 2023-11-08 Github Local Demo
Star
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
arXiv 2023-11-07 Github Demo
Star
OtterHD: A High-Resolution Multi-modality Model
arXiv 2023-11-07 Github -
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding arXiv 2023-11-06 Coming soon -
Star
GLaMM: Pixel Grounding Large Multimodal Model
arXiv 2023-11-06 Github Demo
Star
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
arXiv 2023-11-02 Github -
Star
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
arXiv 2023-10-14 Github Local Demo
Star
Ferret: Refer and Ground Anything Anywhere at Any Granularity
arXiv 2023-10-11 Github -
Star
CogVLM: Visual Expert For Large Language Models
arXiv 2023-10-09 Github Demo
Star
Improved Baselines with Visual Instruction Tuning
arXiv 2023-10-05 Github Demo
Star
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs
arXiv 2023-10-01 Github -
Star
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants
arXiv 2023-10-01 Github Local Demo
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model arXiv 2023-09-27 - -
Star
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
arXiv 2023-09-26 Github Local Demo
Star
DreamLLM: Synergistic Multimodal Comprehension and Creation
arXiv 2023-09-20 Github Coming soon
An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models arXiv 2023-09-18 Coming soon -
Star
TextBind: Multi-turn Interleaved Multimodal Instruction-following
arXiv 2023-09-14 Github Demo
Star
NExT-GPT: Any-to-Any Multimodal LLM
arXiv 2023-09-11 Github Demo
Star
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics
arXiv 2023-09-13 Github -
Star
ImageBind-LLM: Multi-modality Instruction Tuning
arXiv 2023-09-07 Github Demo
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning arXiv 2023-09-05 - -
Star
PointLLM: Empowering Large Language Models to Understand Point Clouds
arXiv 2023-08-31 Github Demo
Star
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
arXiv 2023-08-31 Github Local Demo
Star
MLLM-DataEngine: An Iterative Refinement Approach for MLLM
arXiv 2023-08-25 Github -
Star
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models
arXiv 2023-08-25 Github Demo
Star
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
arXiv 2023-08-24 Github Demo
Star
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
arXiv 2023-08-23 Github Demo
Star
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
arXiv 2023-08-20 Github -
Star
BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions
arXiv 2023-08-19 Github Demo
Star
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions
arXiv 2023-08-08 Github -
Star
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
arXiv 2023-08-03 Github Demo
Star
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
arXiv 2023-07-31 Github Local Demo
Star
3D-LLM: Injecting the 3D World into Large Language Models
arXiv 2023-07-24 Github -
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning
arXiv 2023-07-18 - Demo
Star
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
arXiv 2023-07-17 Github Demo
Star
SVIT: Scaling up Visual Instruction Tuning
arXiv 2023-07-09 Github -
Star
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
arXiv 2023-07-07 Github Demo
Star
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
arXiv 2023-07-05 Github -
Star
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
arXiv 2023-07-04 Github Demo
Star
Visual Instruction Tuning with Polite Flamingo
arXiv 2023-07-03 Github Demo
Star
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
arXiv 2023-06-29 Github Demo
Star
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
arXiv 2023-06-27 Github Demo
Star
MotionGPT: Human Motion as a Foreign Language
arXiv 2023-06-26 Github -
Star
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
arXiv 2023-06-15 Github Coming soon
Star
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
arXiv 2023-06-11 Github Demo
Star
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
arXiv 2023-06-08 Github Demo
Star
MI-IT: Multi-Modal In-Context Instruction Tuning
arXiv 2023-06-08 Github Demo
M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning arXiv 2023-06-07 - -
Star
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
arXiv 2023-06-05 Github Demo
Star
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
arXiv 2023-06-01 Github -
Star
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
arXiv 2023-05-30 Github Demo
Star
PandaGPT: One Model To Instruction-Follow Them All
arXiv 2023-05-25 Github Demo
Star
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
arXiv 2023-05-25 Github -
Star
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
arXiv 2023-05-24 Github Local Demo
Star
DetGPT: Detect What You Need via Reasoning
arXiv 2023-05-23 Github Demo
Star
Pengi: An Audio Language Model for Audio Tasks
arXiv 2023-05-19 Github -
Star
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
arXiv 2023-05-18 Github -
Star
Listen, Think, and Understand
arXiv 2023-05-18 Github Demo
Star
VisualGLM-6B
- 2023-05-17 Github Local Demo
Star
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
arXiv 2023-05-17 Github -
Star
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
arXiv 2023-05-11 Github Local Demo
Star
VideoChat: Chat-Centric Video Understanding
arXiv 2023-05-10 Github Demo
Star
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
arXiv 2023-05-08 Github Demo
Star
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
arXiv 2023-05-07 Github -
Star
LMEye: An Interactive Perception Network for Large Language Models
arXiv 2023-05-05 Github Local Demo
Star
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
arXiv 2023-04-28 Github Demo
Star
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
arXiv 2023-04-27 Github Demo
Star
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
arXiv 2023-04-20 Github -
Star
Visual Instruction Tuning
NeurIPS 2023-04-17 GitHub Demo
Star
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
arXiv 2023-03-28 Github Demo
Star
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
ACL 2022-12-21 Github -

Multimodal In-Context Learning

Title Venue Date Code Demo
Hijacking Context in Large Multi-modal Models arXiv 2023-12-07 - -
Towards More Unified In-context Visual Understanding arXiv 2023-12-05 - -
Star
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
arXiv 2023-09-14 Github Demo
Star
Link-Context Learning for Multimodal LLMs
arXiv 2023-08-15 Github Demo
Star
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
arXiv 2023-08-02 Github Demo
Star
Med-Flamingo: a Multimodal Medical Few-shot Learner
arXiv 2023-07-27 Github Local Demo
Star
Generative Pretraining in Multimodality
arXiv 2023-07-11 Github Demo
AVIS: Autonomous Visual Information Seeking with Large Language Models arXiv 2023-06-13 - -
Star
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
arXiv 2023-06-08 Github Demo
Star
Exploring Diverse In-Context Configurations for Image Captioning
NeurIPS 2023-05-24 Github -
Star
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
arXiv 2023-04-19 Github Demo
Star
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
arXiv 2023-03-30 Github Demo
Star
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
arXiv 2023-03-20 Github Demo
Star
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction
ICCV 2023-03-09 Github -
Star
Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering
CVPR 2023-03-03 Github -
Star
Visual Programming: Compositional visual reasoning without training
CVPR 2022-11-18 Github Local Demo
Star
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
AAAI 2022-06-28 Github -
Star
Flamingo: a Visual Language Model for Few-Shot Learning
NeurIPS 2022-04-29 Github Demo
Multimodal Few-Shot Learning with Frozen Language Models NeurIPS 2021-06-25 - -

Multimodal Chain-of-Thought

Title Venue Date Code Demo
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models NeurIPS 2023-10-25 Github -
Star
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
arXiv 2023-06-27 Github Demo
Star
Explainable Multimodal Emotion Reasoning
arXiv 2023-06-27 Github -
Star
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
arXiv 2023-05-24 Github -
Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction arXiv 2023-05-23 - -
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering arXiv 2023-05-05 - -
Star
Caption Anything: Interactive Image Description with Diverse Multimodal Controls
arXiv 2023-05-04 Github Demo
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings arXiv 2023-05-03 Coming soon -
Star
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
arXiv 2023-04-19 Github Demo
Chain of Thought Prompt Tuning in Vision Language Models arXiv 2023-04-16 Coming soon -
Star
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
arXiv 2023-03-20 Github Demo
Star
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
arXiv 2023-03-08 Github Demo
Star
Multimodal Chain-of-Thought Reasoning in Language Models
arXiv 2023-02-02 Github -
Star
Visual Programming: Compositional visual reasoning without training
CVPR 2022-11-18 Github Local Demo
Star
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
NeurIPS 2022-09-20 Github -

LLM-Aided Visual Reasoning

Title Venue Date Code Demo
Star
V∗: Guided Visual Search as a Core Mechanism in Multimodal LLMs
arXiv 2023-12-21 Github Local Demo
Star
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
arXiv 2023-11-01 Github Demo
MM-VID: Advancing Video Understanding with GPT-4V(vision) arXiv 2023-10-30 - -
Star
ControlLLM: Augment Language Models with Tools by Searching on Graphs
arXiv 2023-10-26 Github -
Star
Woodpecker: Hallucination Correction for Multimodal Large Language Models
arXiv 2023-10-24 Github Demo
Star
MindAgent: Emergent Gaming Interaction
arXiv 2023-09-18 Github -
Star
LISA: Reasoning Segmentation via Large Language Model
arXiv 2023-08-01 Github Demo
Star
Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language
arXiv 2023-06-28 Github Demo
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models arXiv 2023-06-15 - -
Star
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn
arXiv 2023-06-14 Github -
AVIS: Autonomous Visual Information Seeking with Large Language Models arXiv 2023-06-13 - -
Star
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
arXiv 2023-05-30 Github Demo
Mindstorms in Natural Language-Based Societies of Mind arXiv 2023-05-26 - -
Star
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models
arXiv 2023-05-24 Github -
Star
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models
arXiv 2023-05-24 Github Local Demo
Star
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation
arXiv 2023-05-10 Github -
Star
Caption Anything: Interactive Image Description with Diverse Multimodal Controls
arXiv 2023-05-04 Github Demo
Star
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
arXiv 2023-04-19 Github Demo
Star
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
arXiv 2023-03-30 Github Demo
Star
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
arXiv 2023-03-20 Github Demo
Star
ViperGPT: Visual Inference via Python Execution for Reasoning
arXiv 2023-03-14 Github Local Demo
Star
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions
arXiv 2023-03-12 Github Local Demo
ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction ICCV 2023-03-09 - -
Star
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
arXiv 2023-03-08 Github Demo
Star
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners
CVPR 2023-03-03 Github -
Star
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models
arXiv 2022-11-28 Github -
Star
PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning
CVPR 2022-11-21 Github -
Star
Visual Programming: Compositional visual reasoning without training
CVPR 2022-11-18 Github Local Demo
Star
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
arXiv 2022-04-01 Github -

Foundation Models

Title Venue Date Code Demo
Gemini: A Family of Highly Capable Multimodal Models Google 2023-12-06 - -
Fuyu-8B: A Multimodal Architecture for AI Agents blog 2023-10-17 Huggingface Demo
Star
Unified Model for Image, Video, Audio and Language Tasks
arXiv 2023-07-30 Github Demo
PaLI-3 Vision Language Models: Smaller, Faster, Stronger arXiv 2023-10-13 - -
GPT-4V(ision) System Card OpenAI 2023-09-25 - -
Star
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
arXiv 2023-09-09 Github -
Multimodal Foundation Models: From Specialists to General-Purpose Assistants arXiv 2023-09-18 - -
Star
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
NeurIPS 2023-07-13 Github -
Star
Generative Pretraining in Multimodality
arXiv 2023-07-11 Github Demo
Star
Kosmos-2: Grounding Multimodal Large Language Models to the World
arXiv 2023-06-26 Github Demo
Star
Transfer Visual Prompt Generator across LLMs
arXiv 2023-05-02 Github Demo
GPT-4 Technical Report arXiv 2023-03-15 - -
PaLM-E: An Embodied Multimodal Language Model arXiv 2023-03-06 - Demo
Star
Prismer: A Vision-Language Model with An Ensemble of Experts
arXiv 2023-03-04 Github Demo
Star
Language Is Not All You Need: Aligning Perception with Language Models
arXiv 2023-02-27 Github -
Star
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
arXiv 2023-01-30 Github Demo
Star
VIMA: General Robot Manipulation with Multimodal Prompts
ICML 2022-10-06 Github Local Demo
Star
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge
NeurIPS 2022-06-17 Github -
Star
Write and Paint: Generative Vision-Language Models are Unified Modal Learners
ICLR 2022-06-15 Github -
Star
Language Models are General-Purpose Interfaces
arXiv 2022-06-13 Github -

Evaluation

Title Venue Date Page
Stars
A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
arXiv 2023-12-19 Github
Stars
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
arXiv 2023-12-05 Github
Star
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs
arXiv 2023-11-27 Github
Star
MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V
arXiv 2023-11-23 Github
VLM-Eval: A General Evaluation on Video Large Language Models arXiv 2023-11-20 Coming soon
Star
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges
arXiv 2023-11-06 Github
Star
On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving
arXiv 2023-11-09 Github
Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead arXiv 2023-11-05 -
A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging arXiv 2023-10-31 -
Star
An Early Evaluation of GPT-4V(ision)
arXiv 2023-10-25 Github
Star
Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation
arXiv 2023-10-25 Github
Star
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
arXiv 2023-10-23 Github
Star
MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models
arXiv 2023-10-03 Github
Star
Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations
arXiv 2023-10-02 Github
Star
Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning
arXiv 2023-10-01 Github
Star
Can We Edit Multimodal Large Language Models?
arXiv 2023-10-12 Github
Star
REVO-LION: Evaluating and Refining Vision-Language Instruction Tuning Datasets
arXiv 2023-10-10 Github
The Dawn of LMMs: Preliminary Explorations with GPT-4V(vision) arXiv 2023-09-29 -
Star
TouchStone: Evaluating Vision-Language Models by Language Models
arXiv 2023-08-31 Github
Star
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
arXiv 2023-08-31 Github
Star
SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs
arXiv 2023-08-07 Github
Star
Tiny LVLM-eHub: Early Multimodal Experiments with Bard
arXiv 2023-08-07 Github
Star
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
arXiv 2023-08-04 Github
Star
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
arXiv 2023-07-30 Github
Star
MMBench: Is Your Multi-modal Model an All-around Player?
arXiv 2023-07-12 Github
Star
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
arXiv 2023-06-23 Github
Star
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
arXiv 2023-06-15 Github
Star
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
arXiv 2023-06-11 Github
Star
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
arXiv 2023-06-08 Github
Star
On The Hidden Mystery of OCR in Large Multimodal Models
arXiv 2023-05-13 Github

Multimodal Hallucination

Title Venue Date Code Demo
Star
MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations
arXiv 2023-12-06 Github -
Star
Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites
arXiv 2023-12-04 Github -
Star
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
arXiv 2023-12-01 Github Demo
Star
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
arXiv 2023-11-29 Github -
Star
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
arXiv 2023-11-28 Github -
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization arXiv 2023-11-28 Coming soon -
Mitigating Hallucination in Visual Language Models with Visual Supervision arXiv 2023-11-27 - -
Star
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data
arXiv 2023-11-22 Github -
Star
FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models
arXiv 2023-11-02 Github -
Star
Woodpecker: Hallucination Correction for Multimodal Large Language Models
arXiv 2023-10-24 Github Demo
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models arXiv 2023-10-09 - -
HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption arXiv 2023-10-03 - -
Star
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
arXiv 2023-10-01 Github -
Star
Aligning Large Multimodal Models with Factually Augmented RLHF
arXiv 2023-09-25 Github Demo
Evaluation and Mitigation of Agnosia in Multimodal Large Language Models arXiv 2023-09-07 - -
CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning arXiv 2023-09-05 - -
Evaluation and Analysis of Hallucination in Large Vision-Language Models arXiv 2023-08-29 - -
Star
VIGC: Visual Instruction Generation and Correction
arXiv 2023-08-24 Github Demo
Detecting and Preventing Hallucinations in Large Vision Language Models arXiv 2023-08-11 - -
Star
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
arXiv 2023-06-26 Github Demo
Star
Evaluating Object Hallucination in Large Vision-Language Models
EMNLP 2023-05-17 Github -

Multimodal RLHF

Title Venue Date Code Demo
Star
Silkie: Preference Distillation for Large Visual Language Models
arXiv 2023-12-17 Github -
Star
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
arXiv 2023-12-01 Github Demo
Star
Aligning Large Multimodal Models with Factually Augmented RLHF
arXiv 2023-09-25 Github Demo

Others

Title Venue Date Code Demo
Star
VCoder: Versatile Vision Encoders for Multimodal Large Language Models
arXiv 2023-12-21 Github Local Demo
Star
Prompt Highlighter: Interactive Control for Multi-Modal LLMs
arXiv 2023-12-07 Github -
Star
Planting a SEED of Vision in Large Language Model
arXiv 2023-07-16 Github
Star
Can Large Pre-trained Models Help Vision Models on Perception Tasks?
arXiv 2023-06-01 Github -
Star
Contextual Object Detection with Multimodal Large Language Models
arXiv 2023-05-29 Github Demo
Star
Generating Images with Multimodal Language Models
arXiv 2023-05-26 Github -
Star
On Evaluating Adversarial Robustness of Large Vision-Language Models
arXiv 2023-05-26 Github -
Star
Grounding Language Models to Images for Multimodal Inputs and Outputs
ICML 2023-01-31 Github Demo

Awesome Datasets

Datasets of Pre-Training for Alignment

Name Paper Type Modalities
ShareGPT4V ShareGPT4V: Improving Large Multi-Modal Models with Better Captions Caption Image-Text
AS-1B The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World Hybrid Image-Text
InternVid InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation Caption Video-Text
MS-COCO Microsoft COCO: Common Objects in Context Caption Image-Text
SBU Captions Im2Text: Describing Images Using 1 Million Captioned Photographs Caption Image-Text
Conceptual Captions Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning Caption Image-Text
LAION-400M LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs Caption Image-Text
VG Captions Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations Caption Image-Text
Flickr30k Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models Caption Image-Text
AI-Caps AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding Caption Image-Text
Wukong Captions Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark Caption Image-Text
GRIT Kosmos-2: Grounding Multimodal Large Language Models to the World Caption Image-Text-Bounding-Box
Youku-mPLUG Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks Caption Video-Text
MSR-VTT MSR-VTT: A Large Video Description Dataset for Bridging Video and Language Caption Video-Text
Webvid10M Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval Caption Video-Text
WavCaps WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research Caption Audio-Text
AISHELL-1 AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline ASR Audio-Text
AISHELL-2 AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale ASR Audio-Text
VSDial-CN X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages ASR Image-Audio-Text

Datasets of Multimodal Instruction Tuning

Name Paper Link Notes
M3DBench M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts Link A large-scale 3D instruction tuning dataset
LVIS-Instruct4V To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning Link A visual instruction dataset via self-instruction from GPT-4V
ComVint What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning Link A synthetic instruction dataset for complex visual reasoning
SparklesDialogue ✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models Link A machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions to augment the conversational competence of instruction-following LLMs across multiple images and dialogue turns.
StableLLaVA StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data Link A cheap and effective approach to collect visual instruction tuning data
M-HalDetect Detecting and Preventing Hallucinations in Large Vision Language Models Coming soon A dataset used to train and benchmark models for hallucination detection and prevention
MGVLID ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning - A high-quality instruction-tuning dataset including image-text and region-text pairs
BuboGPT BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs Link A high-quality instruction-tuning dataset including audio-text audio caption data and audio-image-text localization data
SVIT SVIT: Scaling up Visual Instruction Tuning Link A large-scale dataset with 4.2M informative visual instruction tuning data, including conversations, detailed descriptions, complex reasoning and referring QAs
mPLUG-DocOwl mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding Link An instruction tuning dataset featuring a wide range of visual-text understanding tasks including OCR-free document understanding
PF-1M Visual Instruction Tuning with Polite Flamingo Link A collection of 37 vision-language datasets with responses rewritten by Polite Flamingo.
LLaVAR LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding Link A visual instruction-tuning dataset for Text-rich Image Understanding
MotionGPT MotionGPT: Human Motion as a Foreign Language Link A instruction-tuning dataset including multiple human motion-related tasks
LRV-Instruction Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning Link Visual instruction tuning dataset for addressing hallucination issue
Macaw-LLM Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration Link A large-scale multi-modal instruction dataset in terms of multi-turn dialogue
LAMM-Dataset LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark Link A comprehensive multi-modal instruction tuning dataset
Video-ChatGPT Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Link 100K high-quality video instruction dataset
MIMIC-IT MIMIC-IT: Multi-Modal In-Context Instruction Tuning Link Multimodal in-context instruction tuning
M3IT M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning Link Large-scale, broad-coverage multimodal instruction tuning dataset
LLaVA-Med LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day Coming soon A large-scale, broad-coverage biomedical instruction-following dataset
GPT4Tools GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction Link Tool-related instruction datasets
MULTIS ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst Coming soon Multimodal instruction tuning dataset covering 16 multimodal tasks
DetGPT DetGPT: Detect What You Need via Reasoning Link Instruction-tuning dataset with 5000 images and around 30000 query-answer pairs
PMC-VQA PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering Coming soon Large-scale medical visual question-answering dataset
VideoChat VideoChat: Chat-Centric Video Understanding Link Video-centric multimodal instruction dataset
X-LLM X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages Link Chinese multimodal instruction dataset
LMEye LMEye: An Interactive Perception Network for Large Language Models Link A multi-modal instruction-tuning dataset
cc-sbu-align MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models Link Multimodal aligned dataset for improving model's usability and generation's fluency
LLaVA-Instruct-150K Visual Instruction Tuning Link Multimodal instruction-following data generated by GPT
MultiInstruct MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning Link The first multimodal instruction tuning benchmark dataset

Datasets of In-Context Learning

Name Paper Link Notes
MIC MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning Link A manually constructed instruction tuning dataset including interleaved text-image inputs, inter-related multiple image inputs, and multimodal in-context learning inputs.
MIMIC-IT MIMIC-IT: Multi-Modal In-Context Instruction Tuning Link Multimodal in-context instruction dataset

Datasets of Multimodal Chain-of-Thought

Name Paper Link Notes
EMER Explainable Multimodal Emotion Reasoning Coming soon A benchmark dataset for explainable emotion reasoning task
EgoCOT EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought Coming soon Large-scale embodied planning dataset
VIP Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction Coming soon An inference-time dataset that can be used to evaluate VideoCOT
ScienceQA Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering Link Large-scale multi-choice dataset, featuring multimodal science questions and diverse domains

Datasets of Multimodal RLHF

Name Paper Link Notes
VLFeedback Silkie: Preference Distillation for Large Visual Language Models Link A vision-language feedback dataset annotated by AI

Benchmarks for Evaluation

Name Paper Link Notes
M3DBench M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts Link A 3D-centric benchmark
MLLM-Bench MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V Link GPT-4V evaluation with per-sample criteria
BenchLMM BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models Link A benchmark for assessment of the robustness against different image styles
MMC-Benchmark MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning Link A comprehensive human-annotated benchmark with distinct tasks evaluating reasoning capabilities over charts
MVBench MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Link A comprehensive multimodal benchmark for video understanding
Bingo Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges Link A benchmark for hallucination evaluation that focuses on two common types
MagnifierBench OtterHD: A High-Resolution Multi-modality Model Link A benchmark designed to probe models' ability of fine-grained perception
HallusionBench HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models Link An image-context reasoning benchmark for evaluation of hallucination
MMHal-Bench Aligning Large Multimodal Models with Factually Augmented RLHF Link A benchmark for hallucination evaluation
MathVista MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models Link A benchmark that challenges both visual and math reasoning capabilities
SparklesEval ✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models Link A GPT-assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns based on three distinct criteria.
ISEKAI Link-Context Learning for Multimodal LLMs Link A benchmark comprising exclusively of unseen generated image-label pairs designed for link-context learning
M-HalDetect Detecting and Preventing Hallucinations in Large Vision Language Models Coming soon A dataset used to train and benchmark models for hallucination detection and prevention
I4 Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions Link A benchmark to comprehensively evaluate the instruction following ability on complicated interleaved vision-language instructions
SciGraphQA SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs Link A large-scale chart-visual question-answering dataset
MM-Vet MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities Link An evaluation benchmark that examines large multimodal models on complicated multimodal tasks
SEED-Bench SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension Link A benchmark for evaluation of generative comprehension in MLLMs
MMBench MMBench: Is Your Multi-modal Model an All-around Player? Link A systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models
Lynx What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? Link A comprehensive evaluation benchmark including both image and video tasks
GAVIE Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning Link A benchmark to evaluate the hallucination and instruction following ability
MME MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models Link A comprehensive MLLM Evaluation benchmark
LVLM-eHub LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models Link An evaluation platform for MLLMs
LAMM-Benchmark LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark Link A benchmark for evaluating the quantitative performance of MLLMs on various2D/3D vision tasks
M3Exam M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models Link A multilingual, multimodal, multilevel benchmark for evaluating MLLM
OwlEval mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality Link Dataset for evaluation on multiple capabilities

Others

Name Paper Link Notes
IMAD IMAD: IMage-Augmented multi-modal Dialogue Link Multimodal dialogue dataset
Video-ChatGPT Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Link A quantitative evaluation framework for video-based dialogue models
CLEVR-ATVC Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation Link A synthetic multimodal fine-tuning dataset for learning to reject instructions
Fruit-ATVC Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation Link A manually pictured multimodal fine-tuning dataset for learning to reject instructions
InfoSeek Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? Link A VQA dataset that focuses on asking information-seeking questions
OVEN Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities Link A dataset that focuses on recognizing the Visual Entity on the Wikipedia, from images in the wild

About

✨✨Latest Papers and Datasets on Multimodal Large Language Models, and Their Evaluation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published