Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

226K/year
🤖 AI Summary
Existing video generation methods are constrained by fixed memory windows and heuristic compression strategies, making it difficult to achieve high-quality, temporally unbounded video synthesis. This work proposes an autoregressive framework based on Video Diffusion Transformers (DiT), introducing a learnable memory query mechanism that dynamically filters and compresses historical frames of arbitrary length. Coupled with a unified relative RoPE positional encoding scheme, the approach overcomes the temporal length limitations inherent in pretrained models. The method enables end-to-end optimization under constant computational overhead and, for the first time, demonstrates real-time generation of videos up to 24 hours in duration—exceeding 1.3 million frames—while achieving state-of-the-art performance on both long- and short-duration video tasks, thereby validating the feasibility of infinite video generation.
📝 Abstract
We present Echo Infinity, an autoregressive (AR) framework towards real-time infinite video generation that employs a learnable evolving memory to dynamically filter, abstract, and compress any-length history at constant cost. Existing methods mainly curate memory with predefined KV-cache schedules, fixed-ratio heuristic compression, or inference-time RoPE adaptation. These designs inevitably lose historical information and amplify compounding errors due to their limited cache window and ignorance of autoregressive generation noise. Inspired by human memory consolidation, Echo-Infinity replaces handcrafted memory curation with learnable Memory Query, which are updated by attention and a gating mechanism when past frames are evicted from the local window. The queries are optimized end-to-end with the video diffusion transformers (DiTs), forming an evolving memory that supports arbitrary compression ratios with constant computation independent of video length. They also act as a generalizable generation prior, improving quality even when only the optimized initial state is used. We further introduce Unified Relative RoPE Recipe, which anchors the sink frames to start from id 0 and lets the newest frame id grow at most to the DiTs' pretrained maximum temporal RoPE id throughout training and inference, freeing the model from the finite RoPE constraint and closing the train-test RoPE extrapolation gap. In long and short video generation, Echo-Infinity achieves state-of-the-art performance, and, to our knowledge, demonstrates promising 24-hour (>1.3 M frames) real-time rollouts for the first time, suggesting a practical path toward infinite video generation.
Problem

Research questions and friction points this paper is trying to address.

infinite video generation
autoregressive modeling
memory compression
temporal RoPE extrapolation
real-time generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

evolving memory
autoregressive video generation
learnable memory query
Unified Relative RoPE
infinite video generation