🤖 AI Summary
To address GPU memory bottlenecks and high computational overhead in long-video understanding, this paper proposes a single-pass, efficient video understanding framework. The method introduces a fixed-capacity memory network architecture, incorporating a learnable neural sampler for adaptive keyframe compression, end-to-end differentiable token selection, and a lightweight video encoder to construct discriminative, fixed-size memory representations. By performing feature extraction and compression in a single forward pass over the video, the approach drastically reduces both memory consumption and computational cost. Evaluated on the Rest-ADL dataset, it achieves 18–75× faster inference speed while maintaining competitive accuracy in retrieval and question-answering tasks. This work establishes a scalable, real-time paradigm for long-video understanding—enabling efficient processing without sacrificing semantic fidelity or downstream performance.
📝 Abstract
Long-form video understanding is essential for various applications such as video retrieval, summarizing, and question answering. Yet, traditional approaches demand substantial computing power and are often bottlenecked by GPU memory. To tackle this issue, we present Long-Video Memory Network, Long-VMNet, a novel video understanding method that employs a fixed-size memory representation to store discriminative patches sampled from the input video. Long-VMNet achieves improved efficiency by leveraging a neural sampler that identifies discriminative tokens. Additionally, Long-VMNet only needs one scan through the video, greatly boosting efficiency. Our results on the Rest-ADL dataset demonstrate an 18x -- 75x improvement in inference times for long-form video retrieval and answering questions, with a competitive predictive performance.