BaKlaVa -- Budgeted Allocation of KV cache for Long-context Inference

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In long-context LLM inference, KV cache memory consumption scales linearly with sequence length, and existing uniform compression strategies suffer from limited performance. Method: This paper proposes a head-level adaptive KV cache memory budget allocation method. It employs a lightweight offline profiling step to estimate, once per model, the importance of each attention head to output quality—enabling fine-grained, head-wise dynamic cache retention and compression. Contribution/Results: Departing from conventional uniform pruning, our approach is the first to support importance-aware, differentiated memory budget allocation. Evaluated on LLaMA-3-8B and Qwen2.5-7B, it achieves up to 70% KV cache memory reduction with zero accuracy degradation; under high compression ratios, task accuracy improves by up to 10× compared to baseline methods.

Technology Category

Application Category

📝 Abstract
In Large Language Model (LLM) inference, Key-Value (KV) caches (KV-caches) are essential for reducing time complexity. However, they result in a linear increase in GPU memory as the context length grows. While recent work explores KV-cache eviction and compression policies to reduce memory usage, they often consider uniform KV-caches across all attention heads, leading to suboptimal performance. We introduce BaKlaVa, a method to allocate optimal memory for individual KV-caches across the model by estimating the importance of each KV-cache. Our empirical analysis demonstrates that not all KV-caches are equally critical for LLM performance. Using a one-time profiling approach, BaKlaVa assigns optimal memory budgets to each KV-cache. We evaluated our method on LLaMA-3-8B, and Qwen2.5-7B models, achieving up to a 70% compression ratio while keeping baseline performance and delivering up to an order-of-magnitude accuracy improvement at higher compression levels.
Problem

Research questions and friction points this paper is trying to address.

Optimizes KV cache memory allocation
Reduces GPU memory usage linearly
Improves LLM performance and accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes individual KV-cache memory allocation.
Estimates importance for each KV-cache.
Achieves high compression with maintained performance.
🔎 Similar Papers
A
A. B. Gulhan
The Pennsylvania State University, State College, USA
Krishna Teja Chitty-Venkata
Krishna Teja Chitty-Venkata
ML Research Engineer @ Red Hat
Large Language ModelsQuantizationNeural Architecture SearchPruning
M
M. Emani
Argonne National Lab, Lemont, USA
Mahmut Kandemir
Mahmut Kandemir
Pennsylvania State University
CompilersHigh-Performance ComputingStorageArchitecturePerformance Evaluation
V
Venkat Vishwanath
Argonne National Lab, Lemont, USA