OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

261K/year

🤖 AI Summary

This work addresses the escalating computational cost in autoregressive long-video generation caused by the ever-growing historical key-value (KV) cache, a challenge inadequately mitigated by existing truncation or implicit compression methods that sacrifice explicit access to critical historical details. To overcome this, the authors propose OmniMem, a framework enabling explicit, full-range sparse KV retrieval within a block-based generation paradigm. OmniMem employs an adaptive window exclusion mechanism to alleviate local bias and integrates query-shared KV selection with a per-head decentralized KV access strategy, effectively avoiding combinatorial explosion while allowing each attention head to retrieve non-contiguous historical blocks on demand. Experiments demonstrate that OmniMem achieves significantly improved temporal consistency and a 52.3% increase in dynamic fidelity over strong baselines, all while maintaining comparable memory consumption.

📝 Abstract

Autoregressive (AR) video generation extends videos by producing latent chunks sequentially, but scaling to long videos requires repeated access to a growing historical KV cache. Existing methods reduce this cost by truncating the KV cache or compressing it into implicit memory, but both lose explicit access to query-relevant historical details. We propose OmniMem, an explicit full-range memory retrieval framework that performs sparse KV retrieval over the historical cache. To make this practical for chunk-based AR video generation, OmniMem addresses two issues: (i) local bias in sparse KV selection and (ii) Union Explosion in memory access. Adaptive Window Exclusion removes local-window blocks from the selection candidates when sufficient long-range history is available, preserving the sparse budget for informative long-range retrieval. Query-Shared KV Selection reduces cross-query diversity, while Per-Head Scattered KV Access avoids expanding head-specific selections into a large selected KV buffer. This allows each attention head to retrieve non-contiguous KV blocks according to its own selection pattern. Experiments on long-video generation show that OmniMem improves Dynamic Degree by 52.3% and preserves strong consistency over strong baselines, while maintaining comparable memory usage.

Problem

Research questions and friction points this paper is trying to address.

long video generation

memory retrieval

KV cache

autoregressive modeling

historical context

Innovation

Methods, ideas, or system contributions that make the work stand out.

sparse KV retrieval

adaptive window exclusion

query-shared KV selection