🤖 AI Summary
In long-context LLM inference, KV cache memory consumption scales linearly with sequence length, and existing uniform compression strategies suffer from limited performance. Method: This paper proposes a head-level adaptive KV cache memory budget allocation method. It employs a lightweight offline profiling step to estimate, once per model, the importance of each attention head to output quality—enabling fine-grained, head-wise dynamic cache retention and compression. Contribution/Results: Departing from conventional uniform pruning, our approach is the first to support importance-aware, differentiated memory budget allocation. Evaluated on LLaMA-3-8B and Qwen2.5-7B, it achieves up to 70% KV cache memory reduction with zero accuracy degradation; under high compression ratios, task accuracy improves by up to 10× compared to baseline methods.
📝 Abstract
In Large Language Model (LLM) inference, Key-Value (KV) caches (KV-caches) are essential for reducing time complexity. However, they result in a linear increase in GPU memory as the context length grows. While recent work explores KV-cache eviction and compression policies to reduce memory usage, they often consider uniform KV-caches across all attention heads, leading to suboptimal performance. We introduce BaKlaVa, a method to allocate optimal memory for individual KV-caches across the model by estimating the importance of each KV-cache. Our empirical analysis demonstrates that not all KV-caches are equally critical for LLM performance. Using a one-time profiling approach, BaKlaVa assigns optimal memory budgets to each KV-cache. We evaluated our method on LLaMA-3-8B, and Qwen2.5-7B models, achieving up to a 70% compression ratio while keeping baseline performance and delivering up to an order-of-magnitude accuracy improvement at higher compression levels.