InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding

📅 2024-06-28
🏛️ arXiv.org
📈 Citations: 14
Influential: 1
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately assess multimodal models’ comprehension of ultra-long videos (mean duration: 52.59 minutes), particularly lacking systematic evaluation of higher-order narrative cognition. To address this, we introduce LongVU—the first comprehensive benchmark for hour-scale video understanding—comprising 108.2K high-quality QA pairs grounded in films and TV series. Questions target nine human cognitive skills (e.g., spoiler reasoning), are human-centered in design, and adopt both multiple-choice and open-ended generation formats. LongVU establishes the first formal evaluation framework for ultra-long video understanding, enabling fine-grained skill-level assessment across mainstream large multimodal models (LMMs), including GPT-4o, Gemini 1.5 Flash, and open-source LMMs. Empirical results reveal severe limitations: GPT-4o and Gemini 1.5 Flash achieve only 49.16% and 42.72% average accuracy, with overall scores of 3.22 and 2.71 (out of 5), respectively.

Technology Category

Application Category

📝 Abstract
Understanding long videos, ranging from tens of minutes to several hours, presents unique challenges in video comprehension. Despite the increasing importance of long-form video content, existing benchmarks primarily focus on shorter clips. To address this gap, we introduce InfiniBench a comprehensive benchmark for very long video understanding which presents 1)The longest video duration, averaging 52.59 minutes per video 2) The largest number of question-answer pairs, 108.2K 3) Diversity in questions that examine nine different skills and include both multiple-choice questions and open-ended questions 4) Human-centric, as the video sources come from movies and daily TV shows, with specific human-level question designs such as Movie Spoiler Questions that require critical thinking and comprehensive understanding. Using InfiniBench, we comprehensively evaluate existing Large Multi-Modality Models (LMMs) on each skill, including the commercial models such as GPT-4o and Gemini 1.5 Flash and the open-source models. The evaluation shows significant challenges in our benchmark. Our findings reveal that even leading AI models like GPT-4o and Gemini 1.5 Flash face challenges in achieving high performance in long video understanding, with average accuracies of just 49.16% and 42.72%, and average scores of 3.22 and 2.71 out of 5, respectively. We hope this benchmark will stimulate the LMMs community towards long video and human-level understanding. Our benchmark can be accessed at https://vision-cair.github.io/InfiniBench/
Problem

Research questions and friction points this paper is trying to address.

Evaluating multi-modal models for long video understanding
Assessing cognitive skills in narratively complex inputs
Addressing reliance on pre-trained knowledge over visual understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive benchmark for long video understanding
Largest set of question-answer pairs
Diverse skills spanning grounding and reasoning
🔎 Similar Papers
No similar papers found.
K
Kirolos Ataallah
King Abdullah University of Science and Technology
Chenhui Gou
Chenhui Gou
3nd Y PhD candidate, Monash University;
LLMMultimodality
E
Eslam Abdelrahman
King Abdullah University of Science and Technology
K
Khushbu Pahwa
RICE University
J
Jian Ding
King Abdullah University of Science and Technology
M
Mohamed Elhoseiny
King Abdullah University of Science and Technology