Adaptive KV-Cache Compression without Manually Setting Budget

📅 2025-09-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In LLM inference, KV cache memory consumption grows rapidly with sequence length; existing compression methods impose fixed compression ratios, leading to the “Procrustean bed” problem—rigid resource allocation and degraded performance. This paper proposes the first adaptive KV cache compression framework that eliminates manual compression budget specification. It employs Monte Carlo sampling to simulate future queries and dynamically identifies critical key-value pairs via attention importance scoring, enabling fine-grained aggregation and retention. By decoupling compression from pre-defined ratios, the method achieves demand-driven, real-time cache pruning. Evaluated on GSM8K, RULER, and LongBench, it attains 2× memory reduction while preserving or exceeding baseline accuracy.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) inference relies heavily on KV-caches to accelerate autoregressive decoding, but the resulting memory footprint grows rapidly with sequence length, posing significant efficiency challenges. Current KV-cache compression methods suffer from a Procrustes' bed problem: they force diverse workloads into fixed compression ratios, leading to suboptimal resource allocation and inference performance. To this end, we present GVote, an adaptive KV-cache compression scheme that eliminates manual budget specification while achieving superior accuracy-efficiency trade-offs. GVote operates on the principle that the important keys are the aggregation of keys required by future queries. The method predicts future query attention demands by Monte-Carlo style sampling potential queries and aggregating selected keys to determine the optimal cache budget without manual specification. Experimental evaluation demonstrates GVote's effectiveness across multiple benchmarks, including GSM8K, RULER and Longbench. Compared to baselines, GVote exhibits 2$ imes$ memory reduction while the accuracy maintains higher or comparable.

Problem

Research questions and friction points this paper is trying to address.

Adaptive KV-cache compression without manual budget setting

Reducing memory footprint while maintaining inference performance

Eliminating fixed compression ratios for diverse workloads

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive KV-cache compression without manual budget setting

Monte-Carlo sampling predicts future query attention demands

Automatically determines optimal cache compression through key aggregation

🔎 Similar Papers

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

2024-07-16arXiv.orgCitations: 5

ZACK: Zero-Overhead LLM Inference Acceleration via Dimensionality Compression of the Key-Value Cache

2024-08-07Citations: 1

Authors to Follow