MemRoPE: Training-Free Infinite Video Generation via Evolving Memory Tokens

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work addresses the degradation of visual quality, identity drift, and motion stagnation in long-form video generation caused by the loss of historical context in sliding-window caching strategies. The authors propose a training-free autoregressive framework capable of generating videos of unlimited duration under a fixed memory budget. By compressing historical information into memory tokens via exponential moving average, the method preserves long-term consistency, while a decoupled online rotary positional encoding (RoPE) maintains short-term dynamics. This approach substantially enhances temporal coherence, visual fidelity, and subject consistency in videos spanning from minutes to hours, outperforming existing methods without requiring additional training.

Technology Category

Application Category

📝 Abstract

Autoregressive diffusion enables real-time frame streaming, yet existing sliding-window caches discard past context, causing fidelity degradation, identity drift, and motion stagnation over long horizons. Current approaches preserve a fixed set of early tokens as attention sinks, but this static anchor cannot reflect the evolving content of a growing video. We introduce MemRoPE, a training-free framework with two co-designed components. Memory Tokens continuously compress all past keys into dual long-term and short-term streams via exponential moving averages, maintaining both global identity and recent dynamics within a fixed-size cache. Online RoPE Indexing caches unrotated keys and applies positional embeddings dynamically at attention time, ensuring the aggregation is free of conflicting positional phases. These two mechanisms are mutually enabling: positional decoupling makes temporal aggregation well-defined, while aggregation makes fixed-size caching viable for unbounded generation. Extensive experiments validate that MemRoPE outperforms existing methods in temporal coherence, visual fidelity, and subject consistency across minute- to hour-scale generation.

Problem

Research questions and friction points this paper is trying to address.

infinite video generation

temporal coherence

identity drift

memory caching

autoregressive diffusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory Tokens

Online RoPE Indexing

Training-Free Video Generation