๐ค AI Summary
Existing video reasoning benchmarks are largely confined to understanding short, localized clips and thus fail to evaluate modelsโ capacity for long-horizon, cross-episode multi-hop reasoning over entire television series. To address this gap, this work proposes SagaQAโthe first fine-grained multi-hop reasoning benchmark centered on full-length TV dramas, emphasizing high-level multimodal comprehension of cross-episode event dependencies and narrative structures. We introduce a multi-agent planning framework to systematically compare parallel, sequential, and hybrid reasoning strategies. Experimental results demonstrate that the hybrid planner generates more coherent and complete reasoning chains, significantly outperforming existing approaches on long-form television narrative understanding tasks.
๐ Abstract
We introduce SagaQA, a long-form video benchmark for multi-hop reasoning over full-length TV series. Existing video reasoning benchmarks often emphasize local understanding of adjacent frames or clips. SagaQA addresses this gap by requiring high-level comprehension of extended multimodal narratives in entire TV shows. A distinguishing feature of SagaQA is the granularity of its reasoning steps. Our dataset necessitates long-range reasoning hops to connect information across completely different episodes. This requires models to reason over entire events and actions, demanding a deep understanding of the show's narration and progression at a multimodal level. Motivated by recent progress in agentic methods, we further study how different planning strategies handle such complex reasoning. We categorize these approaches into three classes-Parallel, Sequential, and Hybrid planners-and evaluate their ability to generate coherent and complete reasoning plans. Our results on SagaQA suggest that hybrid planners consistently produce higher-quality plans and exhibit stronger capabilities for complex, high-level narrative understanding in TV shows.