InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding

📅 2024-06-28

🏛️ arXiv.org

📈 Citations: 14

✨ Influential: 1

🤖 AI Summary

Existing benchmarks inadequately assess multimodal models’ comprehension of ultra-long videos (mean duration: 52.59 minutes), particularly lacking systematic evaluation of higher-order narrative cognition. To address this, we introduce LongVU—the first comprehensive benchmark for hour-scale video understanding—comprising 108.2K high-quality QA pairs grounded in films and TV series. Questions target nine human cognitive skills (e.g., spoiler reasoning), are human-centered in design, and adopt both multiple-choice and open-ended generation formats. LongVU establishes the first formal evaluation framework for ultra-long video understanding, enabling fine-grained skill-level assessment across mainstream large multimodal models (LMMs), including GPT-4o, Gemini 1.5 Flash, and open-source LMMs. Empirical results reveal severe limitations: GPT-4o and Gemini 1.5 Flash achieve only 49.16% and 42.72% average accuracy, with overall scores of 3.22 and 2.71 (out of 5), respectively.

Technology Category

Application Category

📝 Abstract

Understanding long videos, ranging from tens of minutes to several hours, presents unique challenges in video comprehension. Despite the increasing importance of long-form video content, existing benchmarks primarily focus on shorter clips. To address this gap, we introduce InfiniBench a comprehensive benchmark for very long video understanding which presents 1)The longest video duration, averaging 52.59 minutes per video 2) The largest number of question-answer pairs, 108.2K 3) Diversity in questions that examine nine different skills and include both multiple-choice questions and open-ended questions 4) Human-centric, as the video sources come from movies and daily TV shows, with specific human-level question designs such as Movie Spoiler Questions that require critical thinking and comprehensive understanding. Using InfiniBench, we comprehensively evaluate existing Large Multi-Modality Models (LMMs) on each skill, including the commercial models such as GPT-4o and Gemini 1.5 Flash and the open-source models. The evaluation shows significant challenges in our benchmark. Our findings reveal that even leading AI models like GPT-4o and Gemini 1.5 Flash face challenges in achieving high performance in long video understanding, with average accuracies of just 49.16% and 42.72%, and average scores of 3.22 and 2.71 out of 5, respectively. We hope this benchmark will stimulate the LMMs community towards long video and human-level understanding. Our benchmark can be accessed at https://vision-cair.github.io/InfiniBench/

Problem

Research questions and friction points this paper is trying to address.

Evaluating multi-modal models for long video understanding

Assessing cognitive skills in narratively complex inputs

Addressing reliance on pre-trained knowledge over visual understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive benchmark for long video understanding

Largest set of question-answer pairs

Diverse skills spanning grounding and reasoning

🔎 Similar Papers

No similar papers found.

Authors to Follow