GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs

📅 2026-05-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

183K/year
🤖 AI Summary
This work addresses the substantial memory overhead of KV caches in long-context large language models, where existing span-based compression methods often suffer from over-merging and semantic degradation due to excessive concentration of information at boundary tokens. To mitigate this, the authors propose GRKV, a training-free KV cache compression approach that, for the first time, integrates ridge regression into the merging process. By globally optimizing the compressed cache to minimize its deviation from the original attention output, GRKV enables balanced redistribution of information from pruned tokens. Combined with a regularization-based update mechanism and span retention strategy, the method effectively alleviates over-merging and prevents excessive smoothing. Experiments demonstrate that GRKV is the only merging technique that consistently improves overall performance on LongBench and RULER benchmarks under extremely low memory overhead.
📝 Abstract
Large language models (LLMs) with extended context lengths rely on the key-value (KV) cache to support attention over prior tokens. However, maintaining the KV cache incurs substantial memory overhead, motivating KV-cache compression methods that enforce a fixed budget through eviction and merging. Modern eviction methods increasingly adopt span-based retention because preserving contiguous spans is empirically effective and better preserves semantic coherence. Yet, when combined with post-eviction merging, span-based retention concentrates merges onto a small set of span-boundary carrier tokens, producing a highly imbalanced merge pattern that exacerbates over-merging and increases information loss. To address this imbalance, we propose GRKV (Global Regression for KV Cache), a training-free KV-cache merging method that directly minimizes the discrepancy between compressed-cache and full-cache attention outputs. GRKV uses ridge-regression-based merge steps to distribute information from evicted tokens across retained tokens, while regularizing the updates to prevent over-smoothing. Across the LongBench and RULER long-context benchmarks, GRKV is the only merging method that improves overall performance with minimal overhead.
Problem

Research questions and friction points this paper is trying to address.

KV cache compression
span-based retention
over-merging
information loss
long-context LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache compression
training-free
global regression
long-context LLMs
attention approximation
🔎 Similar Papers
No similar papers found.