🤖 AI Summary
This work addresses the degradation of visual quality, identity drift, and motion stagnation in long-form video generation caused by the loss of historical context in sliding-window caching strategies. The authors propose a training-free autoregressive framework capable of generating videos of unlimited duration under a fixed memory budget. By compressing historical information into memory tokens via exponential moving average, the method preserves long-term consistency, while a decoupled online rotary positional encoding (RoPE) maintains short-term dynamics. This approach substantially enhances temporal coherence, visual fidelity, and subject consistency in videos spanning from minutes to hours, outperforming existing methods without requiring additional training.
📝 Abstract
Autoregressive diffusion enables real-time frame streaming, yet existing sliding-window caches discard past context, causing fidelity degradation, identity drift, and motion stagnation over long horizons. Current approaches preserve a fixed set of early tokens as attention sinks, but this static anchor cannot reflect the evolving content of a growing video. We introduce MemRoPE, a training-free framework with two co-designed components. Memory Tokens continuously compress all past keys into dual long-term and short-term streams via exponential moving averages, maintaining both global identity and recent dynamics within a fixed-size cache. Online RoPE Indexing caches unrotated keys and applies positional embeddings dynamically at attention time, ensuring the aggregation is free of conflicting positional phases. These two mechanisms are mutually enabling: positional decoupling makes temporal aggregation well-defined, while aggregation makes fixed-size caching viable for unbounded generation. Extensive experiments validate that MemRoPE outperforms existing methods in temporal coherence, visual fidelity, and subject consistency across minute- to hour-scale generation.