VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

career value

257K/year

🤖 AI Summary

Current vision-language models (VLMs) suffer from limited context length and rapid long-term memory decay, hindering effective understanding of ultra-long videos. To address this, we propose the first framework that formulates ultra-long video understanding as a sequential decision-making process. Our approach introduces four key innovations: (1) an adaptive memory caching mechanism that dynamically retains salient frames while compressing redundant information; (2) Progressive State Propagation (PSP), enabling cross-temporal state transfer through hierarchical temporal abstraction; (3) Temporal Cascading Reward (TCR), which alleviates reward sparsity by decomposing global objectives into temporally cascaded sub-rewards; and (4) policy optimization via the PRPO algorithm for stable and efficient reinforcement learning. Evaluated on multiple ultra-long video benchmarks, our method significantly outperforms leading open-source VLMs—achieving a 23.6% improvement in reasoning consistency, a 37% reduction in memory footprint, and a 29% decrease in computational overhead.

Technology Category

Application Category

📝 Abstract

Ultra long video understanding remains an open challenge, as existing vision language models (VLMs) falter on such content due to limited context length and inefficient long term memory retention. To address this, recent works have attempted to construct external knowledge bases and corresponding retrieval agumented generation (RAG) systems, yet these incur enormous storage and computational overhead. In this paper, we propose VideoMem, a novel framework that pioneers models long video understanding as a sequential generation task via adaptive memory management. Specifically, VideoMem dynamically updates a global memory buffer, which adaptively retains critical information while discarding redundant content across the video timeline. To efficiently train VLMs for such long-term tasks, VideoMem integrates the Progressive Grouped Relative Policy Optimization (PRPO) algorithm, equipped with two core modules: Progressive State Propagation (PSP) adaptively retains valid current states, propagates them to the next rollout step, and gradually narrows the model exploration space. Temporal Cascading Reward (TCR) further alleviates reward sparsity, improving sample utilization and accelerating convergence. Extensive experiments demonstrate that VideoMem significantly outperforms existing open-source models across diverse benchmarks for ultra-long video understanding tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhancing ultra-long video understanding with adaptive memory management

Addressing limited context and memory retention in vision language models

Reducing storage and computational overhead in video analysis systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive memory management for sequential video understanding

Progressive Grouped Relative Policy Optimization algorithm integration

Dynamic global memory buffer with redundant content discarding

🔎 Similar Papers

No similar papers found.