Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the dual challenges of limited context length in multimodal large models (MLMs) and high computational cost from dense frame sampling in long-video understanding, this paper proposes a query-aware adaptive frame selection framework. The method explicitly distinguishes between global queries—requiring holistic semantic understanding—and local queries—demanding fine-grained temporal localization—for which it employs training-free uniform sampling and a lightweight relevance-based frame extraction mechanism, respectively, avoiding generic, computationally expensive search strategies. Crucially, the framework requires no model fine-tuning and dynamically routes queries based on type identification, achieving high localization accuracy at low computational overhead. Evaluated on three mainstream long-video understanding benchmarks, our approach consistently outperforms existing state-of-the-art methods even when processing only 256 input frames, demonstrating both efficiency and strong generalization across diverse query types and datasets.

Technology Category

Application Category

📝 Abstract
The application of Large Multimodal Models (LMMs) to long-form video understanding is constrained by limited context lengths and the computationally prohibitive cost of processing dense video tokens. Consequently, recent research has focused on query-aware frame selection, methods that often incur significant computational overhead. This paper challenges the assumption that such complex search mechanisms are universally necessary. We first identify and validate a query typology distinguishing between global query and localized query. We demonstrate that while uniform sampling is both effective and efficient for global queries, localized queries indeed necessitate query-aware selection for optimal performance. Building on this insight, we propose DIG, a training-free frame selection framework that adapts its strategy based on the query type. Specifically,DIG employs efficient uniform sampling for global queries while activating a specialized pipeline to extract query-relevant frames for localized queries. Experiments on three long-form video understanding benchmarks demonstrate that DIG consistently outperforms existing baselines and robustly improves LMM performance, even when scaling the input frame count to 256.
Problem

Research questions and friction points this paper is trying to address.

Adapts frame selection to query types for long-form video understanding
Reduces computational overhead by avoiding complex search for global queries
Improves LMM performance on benchmarks by scaling to 256 frames
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive frame selection based on query typology
Uniform sampling for global queries, specialized extraction for localized
Training-free framework improving LMM performance on long videos
🔎 Similar Papers
No similar papers found.
Jialuo Li
Jialuo Li
Tsinghua University
Computer VisionMachine LearningMultimodalityReinforcement Learning
B
Bin Li
Microsoft Research Asia
J
Jiahao Li
Microsoft Research Asia
Y
Yan Lu
Microsoft Research Asia