🤖 AI Summary
Existing benchmarks inadequately assess multimodal models’ comprehension of ultra-long videos (mean duration: 52.59 minutes), particularly lacking systematic evaluation of higher-order narrative cognition. To address this, we introduce LongVU—the first comprehensive benchmark for hour-scale video understanding—comprising 108.2K high-quality QA pairs grounded in films and TV series. Questions target nine human cognitive skills (e.g., spoiler reasoning), are human-centered in design, and adopt both multiple-choice and open-ended generation formats. LongVU establishes the first formal evaluation framework for ultra-long video understanding, enabling fine-grained skill-level assessment across mainstream large multimodal models (LMMs), including GPT-4o, Gemini 1.5 Flash, and open-source LMMs. Empirical results reveal severe limitations: GPT-4o and Gemini 1.5 Flash achieve only 49.16% and 42.72% average accuracy, with overall scores of 3.22 and 2.71 (out of 5), respectively.
📝 Abstract
Understanding long videos, ranging from tens of minutes to several hours, presents unique challenges in video comprehension. Despite the increasing importance of long-form video content, existing benchmarks primarily focus on shorter clips. To address this gap, we introduce InfiniBench a comprehensive benchmark for very long video understanding which presents 1)The longest video duration, averaging 52.59 minutes per video 2) The largest number of question-answer pairs, 108.2K 3) Diversity in questions that examine nine different skills and include both multiple-choice questions and open-ended questions 4) Human-centric, as the video sources come from movies and daily TV shows, with specific human-level question designs such as Movie Spoiler Questions that require critical thinking and comprehensive understanding. Using InfiniBench, we comprehensively evaluate existing Large Multi-Modality Models (LMMs) on each skill, including the commercial models such as GPT-4o and Gemini 1.5 Flash and the open-source models. The evaluation shows significant challenges in our benchmark. Our findings reveal that even leading AI models like GPT-4o and Gemini 1.5 Flash face challenges in achieving high performance in long video understanding, with average accuracies of just 49.16% and 42.72%, and average scores of 3.22 and 2.71 out of 5, respectively. We hope this benchmark will stimulate the LMMs community towards long video and human-level understanding. Our benchmark can be accessed at https://vision-cair.github.io/InfiniBench/