🤖 AI Summary
To address memory and computational bottlenecks hindering efficiency in hour-long video understanding, this paper proposes a training-free hierarchical visual information retrieval framework. Methodologically, it introduces a novel dual-granularity retrieval mechanism—adaptive anchor frames and anchor tokens—integrated with semantic expansion retrieval, multimodal confidence scoring, query-aware attention-driven token selection, and hierarchical visual compression, jointly preserving both global semantics and fine-grained details. The framework is plug-and-play and requires zero parameter tuning or retraining, drastically reducing computational overhead. Evaluated on LongVideoBench and VideoMME benchmarks, it consistently outperforms both existing trained and training-free methods, achieving state-of-the-art performance. Notably, it enables the first efficient, real-time understanding of ultra-long videos—up to several hours in duration—without compromising accuracy or requiring model adaptation.
📝 Abstract
Current video-based multimodal large language models struggle with hour-level video understanding due to computational constraints and inefficient information extraction from extensive temporal sequences. We propose APVR (Adaptive Pivot Visual information Retrieval), a training-free framework that addresses the memory wall limitation through hierarchical visual information retrieval. APVR operates via two complementary components: Pivot Frame Retrieval employs semantic expansion and multi-modal confidence scoring to identify semantically relevant video frames, while Pivot Token Retrieval performs query-aware attention-driven token selection within the pivot frames. This dual granularity approach enables processing of hour-long videos while maintaining semantic fidelity. Experimental validation on LongVideoBench and VideoMME demonstrates significant performance improvements, establishing state-of-the-art results for not only training-free but also training-based approaches while providing plug-and-play integration capability with existing MLLM architectures.