Natural-Language Temporal Grounding in Hour-Long Videos is a Search Problem: A Benchmark and Empirical Decomposition

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Temporal grounding of natural language queries in hour-long videos has long been overlooked, and existing Video-LLMs severely underperform due to insufficient global search capabilities. This work addresses the challenge by decoupling the task into two stages—retrieval followed by localization—and introduces a novel “retrieve-then-ground” paradigm. We also construct ExtremeWhenBench, the first open-domain benchmark for hour-scale video grounding. Systematic evaluation and failure attribution reveal that 85% of errors stem from inadequate retrieval. Our proposed hybrid architecture achieves a 6.7× performance gain over monolithic Video-LLMs, exposing fundamental limitations of current open-source models in long-form video understanding.
📝 Abstract
Temporal grounding--returning the interval $[t_s, t_e]$ for a natural-language query over a video--is the language interface to long-form video, yet has been studied on short videos; the dynamics of hour-scale natural-language grounding remain underexplored. We take the position that at hour-scale, the binding constraint is search, not recognition: Video-LLMs are bottlenecked not by localizing a nearby event, but--given a natural-language query--by searching for the relevant region of a long video. To test this, we release ExtremeWhenBench, the first open hour-scale grounding benchmark (2,273 queries over 194 videos, mean 75.7 min, max 9 hr) with an open-form query distribution. Every open Video-LLM collapses while a frame-level retrieval baseline outperforms them; a failure taxonomy attributes 85% of failures to search; and a retrieve-then-ground hybrid recovers 6.7x over the monolithic Video-LLM--mirroring retrieve-then-read in open-domain QA.
Problem

Research questions and friction points this paper is trying to address.

Temporal Grounding
Long-form Video
Natural Language Query
Video Search
Hour-scale Video
Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal grounding
long-form video
search problem
Video-LLM
retrieve-then-ground