TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In Transformer inference, KV cache memory consumption scales linearly with sequence length, severely hindering long-context deployment. Existing quantization methods require separate handling of sparse outliers—introducing additional overhead and implementation complexity. This paper proposes a training-free, adaptive KV cache compression framework: (1) dynamic bit-width allocation per layer via inter-layer error sensitivity analysis; (2) mean-centering reparameterization of key-value tensors to systematically eliminate outliers, obviating explicit outlier isolation; and (3) layer-aware dynamic quantization. Evaluated across multiple models and context lengths, the method reduces KV cache memory to 27% of the 16-bit baseline—with zero accuracy degradation. To our knowledge, this is the first KV compression approach achieving high-fidelity quantization without special-case outlier treatment, significantly improving memory efficiency and scalability for large-model inference.

Technology Category

Application Category

📝 Abstract

The key-value (KV) cache in transformer models is a critical component for efficient decoding or inference, yet its memory demands scale poorly with sequence length, posing a major challenge for scalable deployment of large language models. Among several approaches to KV cache compression, quantization of key and value activations has been widely explored. Most KV cache quantization methods still need to manage sparse and noncontiguous outliers separately. To address this, we introduce TaDA, a training-free recipe for KV cache compression with quantization precision that adapts to error sensitivity across layers and a mean centering to eliminate separate outlier handling. Our approach yields substantial accuracy improvements for multiple models supporting various context lengths. Moreover, our approach does not need to separately manage outlier elements -- a persistent hurdle in most traditional quantization methods. Experiments on standard benchmarks demonstrate that our technique reduces KV cache memory footprint to 27% of the original 16-bit baseline while achieving comparable accuracy. Our method paves the way for scalable and high-performance reasoning in language models by potentially enabling inference for longer context length models, reasoning models, and longer chain of thoughts.

Problem

Research questions and friction points this paper is trying to address.

KV cache memory demands scale poorly with sequence length

Existing KV cache quantization methods struggle with outlier handling

Need for scalable high-performance reasoning in language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive KV cache compression via quantization

Mean-centering eliminates outlier handling

Reduces memory footprint to 27% baseline

🔎 Similar Papers

No similar papers found.

Authors to Follow