Narrative Aligned Long Form Video Question Answering

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing long-form video question answering benchmarks, which predominantly rely on local cues and fail to evaluate models’ capacity for deep narrative reasoning—such as tracking character intentions, linking distant events, and reconstructing causal chains across entire films. To this end, we introduce NA-VQA, a new benchmark comprising 88 full-length movies and 4.4K open-ended questions requiring cross-scene information integration. We further propose Video-NaRA, a framework that constructs event-level narrative chains and stores them in structured memory to support long-range reasoning. By incorporating multi-span evidence annotation and a narrative-centric inference mechanism, Video-NaRA moves beyond shallow matching paradigms. Experimental results demonstrate that our approach significantly outperforms current methods on NA-VQA, achieving a 3% performance gain on questions involving distant evidence and substantially enhancing comprehension of complex narrative structures.

Technology Category

Application Category

📝 Abstract
Recent progress in multimodal large language models (MLLMs) has led to a surge of benchmarks for long-video reasoning. However, most existing benchmarks rely on localized cues and fail to capture narrative reasoning, the ability to track intentions, connect distant events, and reconstruct causal chains across an entire movie. We introduce NA-VQA, a benchmark designed to evaluate deep temporal and narrative reasoning in long-form videos. NA-VQA contains 88 full-length movies and 4.4K open-ended question-answer pairs, each grounded in multiple evidence spans labeled as Short, Medium, or Far to assess long-range dependencies. By requiring generative, multi-scene answers, NA-VQA tests whether models can integrate dispersed narrative information rather than rely on shallow pattern matching. To address the limitations of existing approaches, we propose Video-NaRA, a narrative-centric framework that builds event-level chains and stores them in a structured memory for retrieval during reasoning. Extensive experiments show that state-of-the-art MLLMs perform poorly on questions requiring far-range evidence, highlighting the need for explicit narrative modeling. Video-NaRA improves long-range reasoning performance by up to 3 percent, demonstrating its effectiveness in handling complex narrative structures. We will release NA-VQA upon publication.
Problem

Research questions and friction points this paper is trying to address.

narrative reasoning
long-form video
temporal dependency
causal chain
video question answering
Innovation

Methods, ideas, or system contributions that make the work stand out.

narrative reasoning
long-form video QA
event-level chains
structured memory
long-range dependencies
🔎 Similar Papers
No similar papers found.