SagaQA: A Multi-hop Reasoning Benchmark for Long-form Narrative Understanding in TV Series

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing video reasoning benchmarks are largely confined to understanding short, localized clips and thus fail to evaluate models’ capacity for long-horizon, cross-episode multi-hop reasoning over entire television series. To address this gap, this work proposes SagaQA—the first fine-grained multi-hop reasoning benchmark centered on full-length TV dramas, emphasizing high-level multimodal comprehension of cross-episode event dependencies and narrative structures. We introduce a multi-agent planning framework to systematically compare parallel, sequential, and hybrid reasoning strategies. Experimental results demonstrate that the hybrid planner generates more coherent and complete reasoning chains, significantly outperforming existing approaches on long-form television narrative understanding tasks.

📝 Abstract

We introduce SagaQA, a long-form video benchmark for multi-hop reasoning over full-length TV series. Existing video reasoning benchmarks often emphasize local understanding of adjacent frames or clips. SagaQA addresses this gap by requiring high-level comprehension of extended multimodal narratives in entire TV shows. A distinguishing feature of SagaQA is the granularity of its reasoning steps. Our dataset necessitates long-range reasoning hops to connect information across completely different episodes. This requires models to reason over entire events and actions, demanding a deep understanding of the show's narration and progression at a multimodal level. Motivated by recent progress in agentic methods, we further study how different planning strategies handle such complex reasoning. We categorize these approaches into three classes-Parallel, Sequential, and Hybrid planners-and evaluate their ability to generate coherent and complete reasoning plans. Our results on SagaQA suggest that hybrid planners consistently produce higher-quality plans and exhibit stronger capabilities for complex, high-level narrative understanding in TV shows.

Problem

Research questions and friction points this paper is trying to address.

multi-hop reasoning

long-form narrative understanding

TV series

video benchmark

multimodal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-hop reasoning

long-form video understanding

TV series narrative comprehension