🤖 AI Summary
Existing video question-answering benchmarks struggle to support open-domain, multimodal (visual and conversational) complex reasoning and lack effective evaluation mechanisms for free-form answers. This work proposes the first open-domain multimodal video QA benchmark based on movie explainer videos, integrating synchronized visual content with textual summaries to construct approximately 8.2K question-answer pairs. It uniquely introduces explicit textual context as a reference-independent factual basis for answer validation. The benchmark supports multi-granularity question categorization and variable-length video inputs, facilitating fine-grained model analysis. Evaluations across seven state-of-the-art multimodal large language models reveal that vision-dependent questions are the most challenging, models generally favor textual cues over visual evidence, and accurately extracting factual information from videos remains difficult; notably, closed-source and open-source models exhibit comparable performance on video-related tasks.
📝 Abstract
Understanding real-world videos such as movies requires integrating visual and dialogue cues to answer complex questions. Yet existing VideoQA benchmarks struggle to capture this multimodal reasoning and are largely not open-ended, given the difficulty of evaluating free-form answers. In this paper, we introduce a novel open-ended multi-modal VideoQA benchmark, MovieRecapsQA created using movie recap videos--a distinctive type of YouTube content that summarizes a film by presenting its key events through synchronized visual (recap video) and textual (recap summary) modalities. Using the recap summary, we generate $\approx 8.2$ K question-answer (QA) pairs (aligned with movie-subtitles) and provide the necessary"facts"needed to verify an answer in a reference-free manner. To our knowledge, this is the first open-ended VideoQA benchmark that supplies explicit textual context of the input (video and/or text); which we use for evaluation. Our benchmark provides videos of multiple lengths (i.e., recap-segments, movie-segments) and categorizations of questions (by modality and type) to enable fine-grained analysis. We evaluate the performance of seven state-of-the-art MLLMs using our benchmark and observe that: 1) visual-only questions remain the most challenging; 2) models default to textual inputs whenever available; 3) extracting factually accurate information from video content is still difficult for all models; and 4) proprietary and open-source models perform comparably on video-dependent questions.