🤖 AI Summary
Existing autoregressive video generation methods employ static key-value (KV) caching strategies that disregard the varying importance of tokens, leading to the loss of critical spatiotemporal information and the accumulation of redundancy, which ultimately limits both generation quality and efficiency. To address this, this work proposes PaFu-KV, a dynamic KV caching strategy that leverages both past and future context. It introduces a lightweight saliency estimation head—enabled by a novel bidirectional teacher distillation mechanism—to accurately assess the spatiotemporal significance of each token, dynamically retaining high-information tokens while discarding redundant ones. Experiments demonstrate that PaFu-KV achieves high-quality long-form video generation across multiple benchmarks, significantly reducing memory consumption and accelerating inference.
📝 Abstract
Video generation is pivotal to digital media creation, and recent advances in autoregressive video generation have markedly enhanced the efficiency of real-time video synthesis. However, existing approaches generally rely on heuristic KV Cache policies, which ignore differences in token importance in long-term video generation. This leads to the loss of critical spatiotemporal information and the accumulation of redundant, invalid cache, thereby degrading video generation quality and efficiency. To address this limitation, we first observe that token contributions to video generation are highly time-heterogeneous and accordingly propose a novel Past- and Future-Informed KV Cache Policy (PaFu-KV). Specifically, PaFu-KV introduces a lightweight Salience Estimation Head distilled from a bidirectional teacher to estimate salience scores, allowing the KV cache to retain informative tokens while discarding less relevant ones. This policy yields a better quality-efficiency trade-off by shrinking KV cache capacity and reducing memory footprint at inference time. Extensive experiments on benchmarks demonstrate that our method preserves high-fidelity video generation quality while enables accelerated inference, thereby enabling more efficient long-horizon video generation. Our code will be released upon paper acceptance.