Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection

📅 2026-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the redundancy in standard Transformer key-value (KV) caches, where queries, keys, and values share the same dimensionality, leading to excessive memory consumption—particularly in long-context scenarios. The authors uncover an inherent asymmetry in attention mechanisms: the “selection” process governed by query-key interactions requires far fewer dimensions than the “value transmission” pathway. Leveraging this insight, they propose a plug-and-play KV cache compression method based on truncated singular value decomposition (SVD) of the key projection matrix, which decouples low-dimensional selection information from high-dimensional value representations. Without modifying architecture or requiring full retraining, only minimal fine-tuning of query-key projections enables substantial cache compression. Compatible with techniques like grouped-query attention and quantization, the approach achieves 75% key cache reduction (with ~2% performance degradation) on GPT-2 and Mistral-7B, saving 25 GB per user at 128K context length and supporting approximately 60% more concurrent users.

Technology Category

Application Category

📝 Abstract
Standard transformer attention uses identical dimensionality for queries, keys, and values, yet these components serve different roles: queries and keys produce scalar attention weights (selection), while values carry rich representations (value transfer). We show that selection requires only $O(\log N)$ dimensions to distinguish among $N$ relevant token categories (e.g., syntactic roles, semantic clusters, positional patterns) -- far fewer than value transfer needs. We introduce factored keys, which exploit this asymmetry to physically shrink the KV cache of any pretrained model without retraining from scratch -- unlike GQA and MLA, which must be designed into the architecture before pretraining. We factorize each key projection $W_K \approx A_{d \times r} B_{r \times d}$ via truncated SVD (where $r = d_{\text{select}}$), set $W_K'= A$ as the new key projection producing compact $r$-dimensional keys for the cache, and absorb $B^\top$ into the query projection ($W_Q'= W_Q B^\top$) at zero cost -- since queries are never cached. At 7B scale, training from scratch with $r = d_{\text{model}}/4$ matches full-attention perplexity (9.2 vs 9.3 PPL after 20B tokens) while using 12% fewer parameters and training 8% faster. For existing models, SVD + QK fine-tuning (3 epochs, less than 1% of pretraining data) achieves 75% key cache savings at approximately 2% quality cost on both GPT-2 and Mistral-7B. The approach composes with GQA and quantization for up to $16\times$ combined key cache compression. For a 7B model serving 128K context, factored keys save 25 GB of KV cache per user, enabling approximately 60% more concurrent users on identical hardware.
Problem

Research questions and friction points this paper is trying to address.

KV cache
attention mechanism
memory efficiency
transformer
dimensionality reduction
Innovation

Methods, ideas, or system contributions that make the work stand out.

factored keys
KV cache compression
low-dimensional attention
SVD decomposition
attention selection
🔎 Similar Papers
H
Hengshuai Yao
Sapient Intelligence; Department of Computing Science, University of Alberta
X
Xing Chen
Sapient Intelligence
A
Ahmed Murtadha
Sapient Intelligence
Guan Wang
Guan Wang
CEO, Sapient Intelligence
Artificial General IntelligenceReinforcement LearningLarge Language Models