Skip to content

VideoNIAH: A Flexible Synthetic Method for Benchmarking Video MLLMs

Notifications You must be signed in to change notification settings

joez17/VideoNIAH

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Needle In A Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs (VideoNIAH)

VideoQA Multi-Modal VNBench
Gemini GPT-4V GPT-4o


🔥 News

  • 2024.06.13 🌟 We are very proud to launch VideoNIAH, a scalable synthetic method for benchmarking video MLLMs, and VNBench, a comprehensive video synthetic benchmark!

👀 Overview

We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation. VideoNIAH decouples test video content from their query-responses by inserting unrelated image/text 'needles' into original videos. It generates annotations solely from these needles, ensuring diversity in video sources and a variety of query-responses. Additionally, by inserting multiple needles, VideoNIAH rigorously evaluates the temporal understanding capabilities of models.

We utilize VideoNIAH to compile a video benchmark VNBench, including tasks such as retrieval, ordering, and counting. VNBench contains 1350 samples in total. VNBench can efficiently evaluate the fine-grained understanding ability and spatio-temporal modeling ability of a video model, while also supporting the long-context evaluation.

VideoNIAH is a simple yet highly scalable benchmark construction framework, and we believe it will inspire future video benchmark works!

🔍 Dataset

Download the raw videos in VNBench from the google drive link. Download the annotation of VNBench from the huggingface link License:

VNBench is only used for academic research. Commercial use in any form is prohibited.
The copyright of all videos belongs to the video owners.

🔮 Evaluation Pipeline

Prompt:

The common prompt used in our evaluation follows this format:

<QUESTION>
A. <OPTION1>
B. <OPTION2>
C. <OPTION3>
D. <OPTION4>
Answer with the option's letter from the given choices directly.

Evaluation:

We recommend you to save the inference result in the format as example_result.jsonl. Once you have prepared the model responses in this format, please execute our evaluation script eval.py, and you will get the accuracy scores.

python eval.py \
    --path $RESULTS_FILE

If you want to use GPT-3.5 for evaluation, please use the script wo provided gpt_judge.py.

python gpt_judge.py \
    --input_file $INPUT_FILE \
    --output_file $OUTPUT_FILE 

📈 Experimental Results

  • Evaluation results of different Video MLLMs. Please visit the leaderboard for more details.

Citation

If you find our work helpful for your research, please consider citing our work.

@article{zhao2024videoniah,
  title={Needle In A Video Haystack: A Scalable  Synthetic Framework for Benchmarking Video MLLMs},
  author={Zhao, Zijia and Lu, Haoyu and Huo, Yuqi and Du, Yifan and Yue, Tongtian and Guo, Longteng and Wang, Bingning and Chen, Weipeng and Liu, Jing},
  journal={arXiv preprint},
  year={2024}
}

About

VideoNIAH: A Flexible Synthetic Method for Benchmarking Video MLLMs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages