TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering

📅 2025-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In Transformer inference, KV cache memory consumption scales linearly with sequence length, severely hindering long-context deployment. Existing quantization methods require separate handling of sparse outliers—introducing additional overhead and implementation complexity. This paper proposes a training-free, adaptive KV cache compression framework: (1) dynamic bit-width allocation per layer via inter-layer error sensitivity analysis; (2) mean-centering reparameterization of key-value tensors to systematically eliminate outliers, obviating explicit outlier isolation; and (3) layer-aware dynamic quantization. Evaluated across multiple models and context lengths, the method reduces KV cache memory to 27% of the 16-bit baseline—with zero accuracy degradation. To our knowledge, this is the first KV compression approach achieving high-fidelity quantization without special-case outlier treatment, significantly improving memory efficiency and scalability for large-model inference.

Technology Category

Application Category

📝 Abstract
The key-value (KV) cache in transformer models is a critical component for efficient decoding or inference, yet its memory demands scale poorly with sequence length, posing a major challenge for scalable deployment of large language models. Among several approaches to KV cache compression, quantization of key and value activations has been widely explored. Most KV cache quantization methods still need to manage sparse and noncontiguous outliers separately. To address this, we introduce TaDA, a training-free recipe for KV cache compression with quantization precision that adapts to error sensitivity across layers and a mean centering to eliminate separate outlier handling. Our approach yields substantial accuracy improvements for multiple models supporting various context lengths. Moreover, our approach does not need to separately manage outlier elements -- a persistent hurdle in most traditional quantization methods. Experiments on standard benchmarks demonstrate that our technique reduces KV cache memory footprint to 27% of the original 16-bit baseline while achieving comparable accuracy. Our method paves the way for scalable and high-performance reasoning in language models by potentially enabling inference for longer context length models, reasoning models, and longer chain of thoughts.
Problem

Research questions and friction points this paper is trying to address.

KV cache memory demands scale poorly with sequence length
Existing KV cache quantization methods struggle with outlier handling
Need for scalable high-performance reasoning in language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive KV cache compression via quantization
Mean-centering eliminates outlier handling
Reduces memory footprint to 27% baseline
🔎 Similar Papers
No similar papers found.