Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

In large language model (LLM) inference, the memory overhead of Key-Value (KV) caches severely limits deployment efficiency. Existing attention-score-based pruning methods are constrained by their inability to access future-token scores and are incompatible with modern attention kernels—such as Flash Attention—that avoid explicit computation of the full attention matrix. This paper introduces KVPress: a training-free, online, and end-to-end compatible KV cache compression method for both prefill and decoding phases. Its core innovation is the first closed-form estimation of expected attention scores between each query and KV pair, derived solely from intermediate activation statistics—without requiring access to historical or future attention matrices. Leveraging these scores for ranking and pruning, KVPress implements a Flash Attention–friendly online compression mechanism that achieves substantial memory reduction with zero accuracy loss. We open-source an evaluation library encompassing 20+ techniques, demonstrating consistent and significant improvements over state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract

Memory consumption of the Key-Value (KV) cache represents a major bottleneck for efficient large language model inference. While attention-score-based KV cache pruning shows promise, it faces critical practical limitations: attention scores from future tokens are unavailable during compression, and modern implementations like Flash Attention do not materialize the full attention matrix, making past scores inaccessible. To overcome these challenges, we introduce $ extbf{Expected Attention, a training-free compression method}$ that estimates KV pairs importance by predicting how future queries will attend to them. Our approach leverages the distributional properties of LLM activations to compute expected attention scores in closed form for each KV pair. These scores enable principled ranking and pruning of KV pairs with minimal impact on the residual stream, achieving effective compression without performance degradation. Importantly, our method operates seamlessly across both prefilling and decoding phases, consistently outperforming state-of-the-art baselines in both scenarios. Finally, $ extbf{we release KVPress, a comprehensive library to enable researchers to implement and benchmark KV cache compression methods, already including more than 20 techniques}$.

Problem

Research questions and friction points this paper is trying to address.

Compress KV cache to reduce memory in LLM inference

Estimate KV importance by predicting future query attention

Enable training-free compression without performance degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Estimates KV importance using future query distributions

Computes expected attention scores in closed form

Enables seamless compression across prefilling and decoding phases

🔎 Similar Papers

No similar papers found.