π€ AI Summary
Autoregressive video diffusion models often suffer from visual artifacts, quality degradation, and temporal drift when generating minute-long videos, primarily due to limited KV cache capacity and contextual distribution shifts caused by self-generated frames. This work proposes TetherCache, a training-free, plug-and-play cache management strategy that seamlessly integrates Gated Recall (GRAB) and Token-Aligned Memory Editing (TAME). Operating within a fixed cache budget, TetherCache preserves diverse yet relevant historical information while mitigating feature contamination. Built upon the Self-Forcing framework, it employs attention-diversity-balanced gating scores to select critical memory frames and lightly aligns memory token statistics to a reliable distribution. Evaluated on VBench-Long, the method substantially improves generation quality for videos ranging from 30 to 240 seconds, achieving marked gains in overall and semantic scores at 240 seconds and reducing quality drift from 7.84 to 1.33.
π Abstract
Autoregressive video diffusion models provide a natural formulation for streaming and variable-length video generation by conditioning newly generated frames on previously generated content. However, extending these models to minute-level generation remains challenging: the limited KV-cache budget prevents the model from retaining the full history, while repeatedly conditioning on self-generated frames induces a context distribution shift that accumulates over time, leading to visual artifacts, quality degradation, and temporal drift. In this paper, we propose TetherCache, a training-free and plug-and-play cache management strategy for drift-resistant long video generation. TetherCache organizes the cache into sink, memory, and recent regions, and introduces two complementary mechanisms. First, GRAB (Gated Recall with Attention-Diversity Balancing) selects long-range memory frames using a gated score that combines attention-based relevance with temporal diversity, preserving informative yet diverse historical context under a fixed cache budget. Second, TAME (Trusted Alignment via Memory Editing) lightly edits newly recalled memory tokens by aligning their statistics to a trusted context distribution, reducing the pollution caused by drifted historical features. Built on Self-Forcing, TetherCache consistently improves long-video generation quality on VBench-Long across 30s, 60s, and 240s settings. In particular, for 240s generation, it substantially improves overall and semantic scores while reducing quality drift from 7.84 to 1.33, demonstrating its effectiveness for stable long-horizon autoregressive video diffusion.