Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

📅 2024-07-16
🏛️ arXiv.org
📈 Citations: 5
Influential: 1
📄 PDF
🤖 AI Summary
To address the efficiency bottleneck caused by KV cache redundancy in long-sequence inference of large language models (LLMs), this paper proposes the first multi-head adaptive cache compression budget allocation mechanism. We derive the first theoretical upper bound on reconstruction error induced by KV token pruning, and leverage it to design a head-granularity adaptive budget allocation strategy—compatible with plug-and-play integration of mainstream compression methods (e.g., StreamingLLM, KVQuant). By modeling attention output distortion and introducing a lightweight scheduling framework, our method enables joint optimization for both task-aware and task-agnostic scenarios. Evaluated on Ruler (13 datasets) and LongBench (16 datasets), it achieves an average 12.7% reduction in perplexity (PPL) and a 2.1× throughput improvement over state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
Large Language Models have excelled in various domains but face efficiency challenges due to the growing Key-Value (KV) cache required for long-sequence inference. Recent efforts aim to reduce KV cache size by evicting vast non-critical cache elements during runtime while preserving generation quality. However, these methods typically allocate compression budgets uniformly across all attention heads, ignoring the unique attention patterns of each head. In this paper, we establish a theoretical loss upper bound between pre- and post-eviction attention output, explaining the optimization target of prior cache eviction methods, while guiding the optimization of adaptive budget allocation. Base on this, we propose {it Ada-KV}, the first head-wise adaptive budget allocation strategy. It offers plug-and-play benefits, enabling seamless integration with prior cache eviction methods. Extensive evaluations on 13 datasets from Ruler and 16 datasets from LongBench, all conducted under both question-aware and question-agnostic scenarios, demonstrate substantial quality improvements over existing methods.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Long Sentence Processing
Efficiency Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ada-KV
Cache-Size Adjustment
Performance Optimization
Y
Yuan Feng
School of Computer Science, University of Science and Technology of China (USTC), China; Data Darkness Lab, MIRACLE Center, Suzhou Institute for Advanced Research, USTC, China
Junlin Lv
Junlin Lv
USTC
machine learning system
Y
Yukun Cao
School of Computer Science, University of Science and Technology of China (USTC), China; Data Darkness Lab, MIRACLE Center, Suzhou Institute for Advanced Research, USTC, China
X
Xike Xie
School of Biomedical Engineering, USTC, China; Data Darkness Lab, MIRACLE Center, Suzhou Institute for Advanced Research, USTC, China
S
S. Kevin Zhou
School of Biomedical Engineering, USTC, China; Data Darkness Lab, MIRACLE Center, Suzhou Institute for Advanced Research, USTC, China