Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

To address GPU memory bottlenecks and high computational overhead in long-video understanding, this paper proposes a single-pass, efficient video understanding framework. The method introduces a fixed-capacity memory network architecture, incorporating a learnable neural sampler for adaptive keyframe compression, end-to-end differentiable token selection, and a lightweight video encoder to construct discriminative, fixed-size memory representations. By performing feature extraction and compression in a single forward pass over the video, the approach drastically reduces both memory consumption and computational cost. Evaluated on the Rest-ADL dataset, it achieves 18–75× faster inference speed while maintaining competitive accuracy in retrieval and question-answering tasks. This work establishes a scalable, real-time paradigm for long-video understanding—enabling efficient processing without sacrificing semantic fidelity or downstream performance.

Technology Category

Application Category

📝 Abstract

Long-form video understanding is essential for various applications such as video retrieval, summarizing, and question answering. Yet, traditional approaches demand substantial computing power and are often bottlenecked by GPU memory. To tackle this issue, we present Long-Video Memory Network, Long-VMNet, a novel video understanding method that employs a fixed-size memory representation to store discriminative patches sampled from the input video. Long-VMNet achieves improved efficiency by leveraging a neural sampler that identifies discriminative tokens. Additionally, Long-VMNet only needs one scan through the video, greatly boosting efficiency. Our results on the Rest-ADL dataset demonstrate an 18x -- 75x improvement in inference times for long-form video retrieval and answering questions, with a competitive predictive performance.

Problem

Research questions and friction points this paper is trying to address.

Addresses GPU memory bottleneck in long-form video understanding.

Improves efficiency with fixed-size memory and neural sampler.

Enhances inference times for video retrieval and question answering.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fixed-size memory for video representation

Neural sampler identifies discriminative tokens

Single video scan enhances efficiency

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs