🤖 AI Summary
This work addresses the escalating computational cost in autoregressive long-video generation caused by the ever-growing historical key-value (KV) cache, a challenge inadequately mitigated by existing truncation or implicit compression methods that sacrifice explicit access to critical historical details. To overcome this, the authors propose OmniMem, a framework enabling explicit, full-range sparse KV retrieval within a block-based generation paradigm. OmniMem employs an adaptive window exclusion mechanism to alleviate local bias and integrates query-shared KV selection with a per-head decentralized KV access strategy, effectively avoiding combinatorial explosion while allowing each attention head to retrieve non-contiguous historical blocks on demand. Experiments demonstrate that OmniMem achieves significantly improved temporal consistency and a 52.3% increase in dynamic fidelity over strong baselines, all while maintaining comparable memory consumption.
📝 Abstract
Autoregressive (AR) video generation extends videos by producing latent chunks sequentially, but scaling to long videos requires repeated access to a growing historical KV cache. Existing methods reduce this cost by truncating the KV cache or compressing it into implicit memory, but both lose explicit access to query-relevant historical details. We propose OmniMem, an explicit full-range memory retrieval framework that performs sparse KV retrieval over the historical cache. To make this practical for chunk-based AR video generation, OmniMem addresses two issues: (i) local bias in sparse KV selection and (ii) Union Explosion in memory access. Adaptive Window Exclusion removes local-window blocks from the selection candidates when sufficient long-range history is available, preserving the sparse budget for informative long-range retrieval. Query-Shared KV Selection reduces cross-query diversity, while Per-Head Scattered KV Access avoids expanding head-specific selections into a large selected KV buffer. This allows each attention head to retrieve non-contiguous KV blocks according to its own selection pattern. Experiments on long-video generation show that OmniMem improves Dynamic Degree by 52.3% and preserves strong consistency over strong baselines, while maintaining comparable memory usage.