Value-Guided KV Compression for LLMs via Approximated CUR Decomposition

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

To address the high memory footprint and latency induced by KV caching in large language model inference, existing methods rely solely on query-key attention scores for cache pruning, overlooking the direct impact of value vectors on output generation. This work proposes a value-centric KV cache compression framework, introducing CUR matrix decomposition—novel in this context—to approximate leverage scores that jointly model contributions from both keys and values. The method optimizes token retention via end-to-end minimization of reconstruction error. It is fully compatible with FlashAttention and grouped-query attention without requiring architectural modifications. Evaluated on LLaMA and Mistral, the approach achieves up to 40% reduction in generation latency under high compression ratios, while improving accuracy by up to 9.6%, thereby significantly enhancing the speed–accuracy trade-off.

Technology Category

Application Category

📝 Abstract

Key-value (KV) cache compression has emerged as a critical technique for reducing the memory and latency overhead of autoregressive language models during inference. Prior approaches predominantly rely on query-key attention scores to rank and evict cached tokens, assuming that attention intensity correlates with semantic importance. However, this heuristic overlooks the contribution of value vectors, which directly influence the attention output. In this paper, we propose CurDKV, a novel, value-centric KV compression method that selects keys and values based on leverage scores computed from CUR matrix decomposition. Our approach approximates the dominant subspace of the attention output $softmax(QK^T)V$, ensuring that the retained tokens best preserve the model's predictive behavior. Theoretically, we show that attention score approximation does not guarantee output preservation, and demonstrate that CUR-based selection minimizes end-to-end attention reconstruction loss. Empirically, CurDKV achieves up to 9.6% higher accuracy than state-of-the-art methods like SnapKV and ChunkKV under aggressive compression budgets on LLaMA and Mistral, while maintaining compatibility with FlashAttention and Grouped Query Attention. In addition to improved accuracy, CurDKV reduces generation latency by up to 40% at high compression, offering a practical speed-accuracy tradeoff.

Problem

Research questions and friction points this paper is trying to address.

KV cache compression reduces memory and latency in LLMs

Prior methods overlook value vectors' contribution to attention output

Proposed method selects keys and values via CUR decomposition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Value-guided KV compression via CUR decomposition

Selects tokens based on leverage scores

Minimizes attention reconstruction loss theoretically

🔎 Similar Papers

ZACK: Zero-Overhead LLM Inference Acceleration via Dimensionality Compression of the Key-Value Cache