🤖 AI Summary
This work addresses the inefficiency of linear memory growth in KV caching during long-context inference and the degradation in generation quality caused by existing token eviction strategies that neglect semantic directional consistency between retained and evicted tokens. The study identifies this previously overlooked “directional mismatch” issue and introduces a geometrically regularized eviction mechanism grounded in low-dimensional moment statistics—namely token counts, key/value means, and value-key covariances. To mitigate information loss from eviction, the authors further devise a first-order attention approximation that enables closed-form correction, establishing a synergistic eviction–compensation loop. Evaluated on LongBench and RULER benchmarks with LLaMA-3.1-8B-Instruct and Qwen3-4B-Instruct, the proposed method consistently outperforms state-of-the-art approaches across all cache budgets, achieving particularly pronounced gains under high compression ratios.
📝 Abstract
Autoregressive decoding in Transformer-based language models relies on the KV cache, whose memory footprint grows linearly with sequence length and becomes the primary bottleneck for long-context inference. KV cache eviction addresses this by retaining a fixed-size subset of key-value pairs and discarding the rest. We identify that a primary source of output degradation is not the residual attention mass on evicted tokens, which existing methods already minimize, but a directional mismatch between the retained and evicted token sets. Specifically, the evicted tokens in practice are often near-orthogonal to the retained ones. Thus, even a small evicted mass could have an oversized impact on the resulting direction distribution and amplify into substantial output error. This reveals a fundamental limit in existing strategies. To address this, we propose MomentKV, which maintains compact, small-size moment statistics over the evicted token set, including a count, key mean, value mean, and value-key covariance. During eviction, the moment statistics is leveraged to identify tokens already well aligned with and captured by the accumulated summary, keeping the evicted set geometrically regular. During inference, they yield a closed-form first-order approximation of the evicted attention output, forming a mutually reinforcing loop between selective eviction and accurate correction. On LongBench and RULER with LLaMA-3.1-8B-Instruct and Qwen3-4B-Instruct, MomentKV outperforms all baselines at every cache budget, with the largest gains under aggressive compression.