🤖 AI Summary
To address the excessive memory footprint of Key-Value (KV) caches in long-context inference for large language models (LLMs), this paper proposes an adaptive quantization method that jointly models cross-layer KV dependencies for the first time. It introduces lightweight adapters to predict derivable KV information, applying optimal quantization only to the unpredictable residual components—enabling near-lossless compression. The method integrates compact adapter-driven adaptive quantization, explicit cross-layer dependency modeling, and one-shot calibration—requiring no fine-tuning. Evaluated on the Llama 3.2 series, it achieves 2–2.5 bits per value while maintaining <1% accuracy degradation on LongBench and negligible perplexity increase. For the 70B model, single-GPU calibration completes in just 1–6 hours—substantially outperforming existing KV compression approaches in both efficiency and fidelity.
📝 Abstract
Efficient real-world deployments of large language models (LLMs) rely on Key-Value (KV) caching for processing and generating long outputs, reducing the need for repetitive computation. For large contexts, Key-Value caches can take up tens of gigabytes of device memory, as they store vector representations for each token and layer. Recent work has shown that the cached vectors can be compressed through quantization, pruning or merging, but these techniques often compromise quality towards higher compression rates. In this work, we aim to improve Key&Value compression by exploiting two observations: 1) the inherent dependencies between keys and values across different layers, and 2) high-compression mechanisms for internal network states. We propose AQUA-KV, an adaptive quantization for Key-Value caches that relies on compact adapters to exploit existing dependencies between Keys and Values, and aims to"optimally"compress the information that cannot be predicted. AQUA-KV significantly improves compression rates, while maintaining high accuracy on state-of-the-art LLM families. On Llama 3.2 LLMs, we achieve near-lossless inference at 2-2.5 bits per value with under $1%$ relative error in perplexity and LongBench scores. AQUA-KV is one-shot, simple, and efficient: it can be calibrated on a single GPU within 1-6 hours, even for 70B models.