BaKlaVa -- Budgeted Allocation of KV cache for Long-context Inference

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In long-context LLM inference, KV cache memory consumption scales linearly with sequence length, and existing uniform compression strategies suffer from limited performance. Method: This paper proposes a head-level adaptive KV cache memory budget allocation method. It employs a lightweight offline profiling step to estimate, once per model, the importance of each attention head to output quality—enabling fine-grained, head-wise dynamic cache retention and compression. Contribution/Results: Departing from conventional uniform pruning, our approach is the first to support importance-aware, differentiated memory budget allocation. Evaluated on LLaMA-3-8B and Qwen2.5-7B, it achieves up to 70% KV cache memory reduction with zero accuracy degradation; under high compression ratios, task accuracy improves by up to 10× compared to baseline methods.

Technology Category

Application Category

📝 Abstract

In Large Language Model (LLM) inference, Key-Value (KV) caches (KV-caches) are essential for reducing time complexity. However, they result in a linear increase in GPU memory as the context length grows. While recent work explores KV-cache eviction and compression policies to reduce memory usage, they often consider uniform KV-caches across all attention heads, leading to suboptimal performance. We introduce BaKlaVa, a method to allocate optimal memory for individual KV-caches across the model by estimating the importance of each KV-cache. Our empirical analysis demonstrates that not all KV-caches are equally critical for LLM performance. Using a one-time profiling approach, BaKlaVa assigns optimal memory budgets to each KV-cache. We evaluated our method on LLaMA-3-8B, and Qwen2.5-7B models, achieving up to a 70% compression ratio while keeping baseline performance and delivering up to an order-of-magnitude accuracy improvement at higher compression levels.

Problem

Research questions and friction points this paper is trying to address.

Optimizes KV cache memory allocation

Reduces GPU memory usage linearly

Improves LLM performance and accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes individual KV-cache memory allocation.

Estimates importance for each KV-cache.

Achieves high compression with maintained performance.

🔎 Similar Papers

PQCache: Product Quantization-based KVCache for Long Context LLM Inference

2024-07-01arXiv.orgCitations: 18

Authors to Follow