π€ AI Summary
This work addresses the challenge of maintaining fine-grained spatiotemporal consistency in long-duration video generation by introducing DecMem, a decoupled memory architecture that efficiently retrieves historical information through sparse global memory while stabilizing local details via anchored local memory. By decoupling global and local memory mechanisms, DecMem overcomes the limitations of conventional learnable memory approaches, which suffer from high computational overhead and attention dispersion during long-sequence inference. The proposed method significantly enhances both generation efficiency and temporal coherence. Experimental results demonstrate that DecMem outperforms state-of-the-art methods on minute-scale high-fidelity video synthesis, enabling controllable, high-quality, and spatiotemporally consistent long-form video generation.
π Abstract
Recent advances in video generative models have promoted rapid progress in controllable world models. However, maintaining fine-grained spatio-temporal consistency under long-horizon reasoning remains a key challenge. In this work, we move beyond explicit 3D memory and coarse frame-level implicit modeling, and propose a fine-grained, learnable, and scalable memory for consistent world generation. We first identify two fundamental limitations of naΓ―ve learnable memory architectures in long-horizon extrapolation, namely computational inefficiency and attention dispersion. Through a systematic analysis of attention dispersion, we propose DecMem, a decoupled memory architecture that employs Sparse Global Memory for efficient fine-grained access to global history and Anchored Local Memory for stable and high-quality extrapolation. Extensive experiments demonstrate that DecMem significantly outperforms current state-of-the-art methods. By ensuring precise and efficient long-term memory and achieving superior extrapolation capabilities, DecMem enables minute-level controllable long video generation with high fidelity and consistency.