SagaQA: A Multi-hop Reasoning Benchmark for Long-form Narrative Understanding in TV Series

๐Ÿ“… 2026-06-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

189K/year
๐Ÿค– AI Summary
Existing video reasoning benchmarks are largely confined to understanding short, localized clips and thus fail to evaluate modelsโ€™ capacity for long-horizon, cross-episode multi-hop reasoning over entire television series. To address this gap, this work proposes SagaQAโ€”the first fine-grained multi-hop reasoning benchmark centered on full-length TV dramas, emphasizing high-level multimodal comprehension of cross-episode event dependencies and narrative structures. We introduce a multi-agent planning framework to systematically compare parallel, sequential, and hybrid reasoning strategies. Experimental results demonstrate that the hybrid planner generates more coherent and complete reasoning chains, significantly outperforming existing approaches on long-form television narrative understanding tasks.
๐Ÿ“ Abstract
We introduce SagaQA, a long-form video benchmark for multi-hop reasoning over full-length TV series. Existing video reasoning benchmarks often emphasize local understanding of adjacent frames or clips. SagaQA addresses this gap by requiring high-level comprehension of extended multimodal narratives in entire TV shows. A distinguishing feature of SagaQA is the granularity of its reasoning steps. Our dataset necessitates long-range reasoning hops to connect information across completely different episodes. This requires models to reason over entire events and actions, demanding a deep understanding of the show's narration and progression at a multimodal level. Motivated by recent progress in agentic methods, we further study how different planning strategies handle such complex reasoning. We categorize these approaches into three classes-Parallel, Sequential, and Hybrid planners-and evaluate their ability to generate coherent and complete reasoning plans. Our results on SagaQA suggest that hybrid planners consistently produce higher-quality plans and exhibit stronger capabilities for complex, high-level narrative understanding in TV shows.
Problem

Research questions and friction points this paper is trying to address.

multi-hop reasoning
long-form narrative understanding
TV series
video benchmark
multimodal reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-hop reasoning
long-form video understanding
TV series narrative comprehension
agentic planning
hybrid planners