APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval

📅 2025-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address memory and computational bottlenecks hindering efficiency in hour-long video understanding, this paper proposes a training-free hierarchical visual information retrieval framework. Methodologically, it introduces a novel dual-granularity retrieval mechanism—adaptive anchor frames and anchor tokens—integrated with semantic expansion retrieval, multimodal confidence scoring, query-aware attention-driven token selection, and hierarchical visual compression, jointly preserving both global semantics and fine-grained details. The framework is plug-and-play and requires zero parameter tuning or retraining, drastically reducing computational overhead. Evaluated on LongVideoBench and VideoMME benchmarks, it consistently outperforms both existing trained and training-free methods, achieving state-of-the-art performance. Notably, it enables the first efficient, real-time understanding of ultra-long videos—up to several hours in duration—without compromising accuracy or requiring model adaptation.

Technology Category

Application Category

📝 Abstract
Current video-based multimodal large language models struggle with hour-level video understanding due to computational constraints and inefficient information extraction from extensive temporal sequences. We propose APVR (Adaptive Pivot Visual information Retrieval), a training-free framework that addresses the memory wall limitation through hierarchical visual information retrieval. APVR operates via two complementary components: Pivot Frame Retrieval employs semantic expansion and multi-modal confidence scoring to identify semantically relevant video frames, while Pivot Token Retrieval performs query-aware attention-driven token selection within the pivot frames. This dual granularity approach enables processing of hour-long videos while maintaining semantic fidelity. Experimental validation on LongVideoBench and VideoMME demonstrates significant performance improvements, establishing state-of-the-art results for not only training-free but also training-based approaches while providing plug-and-play integration capability with existing MLLM architectures.
Problem

Research questions and friction points this paper is trying to address.

Addresses hour-level video understanding limitations
Improves inefficient information extraction from long videos
Solves computational constraints in video-based models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical visual information retrieval framework
Semantic expansion and multi-modal scoring
Query-aware attention-driven token selection
🔎 Similar Papers
No similar papers found.
Hong Gao
Hong Gao
Zhejiang Normal University
DatabaseInternet of Things
Y
Yiming Bao
ZTE Corporation
X
Xuezhan Tu
ZTE Corporation
B
Bin Zhong
ZTE Corporation
M
Minling Zhang
SouthEast University