KV Cache Compression for Inference Efficiency in LLMs: A Review

📅 2025-08-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the exponential growth of Key-Value (KV) cache memory consumption with context length in large language model (LLM) inference—causing severe memory bottlenecks—this work systematically surveys three mainstream optimization paradigms: selective caching, low-bit quantization, and attention compression, exposing their limitations in compute-storage trade-offs, task generalizability, and hardware compatibility. We propose a hybrid optimization framework, a dynamic adaptive caching strategy, and a hardware-software co-design paradigm to overcome the interoperability constraints inherent in single-method approaches. Experiments demonstrate that our method achieves an average 62% reduction in KV cache memory footprint while sustaining less than 1% accuracy degradation, and improves end-to-end inference throughput by 3.1×. The approach provides a scalable, theoretically grounded pathway for efficient deployment of long-context LLMs.

Technology Category

Application Category

📝 Abstract
Withtherapid advancement of large language models (LLMs), the context length for inference has been continuously increasing, leading to an exponential growth in the demand for Key-Value (KV) caching. This has resulted in a significant memory bottleneck, limiting the inference efficiency and scalability of the models. Therefore, optimizing the KV cache during inference is crucial for enhancing performance and efficiency. This review systematically examines current KV cache optimization techniques, including compression strategies such as selective token strategies, quantization, and attention compression. We evaluate the effectiveness, trade-offs, and application scenarios of these methods, providing a comprehensive analysis of their impact on memory usage and inference speed. We focus on identifying the limitations and challenges of existing methods, such as compatibility issues with different models and tasks. Additionally, this review highlights future research directions, including hybrid optimization techniques, adaptive dynamic strategies, and software-hardware co-design. These approaches aim to improve inference efficiency and promote the practical application of large language models.
Problem

Research questions and friction points this paper is trying to address.

Reduce KV cache memory bottleneck in LLMs
Evaluate KV cache compression techniques' trade-offs
Address compatibility issues in optimization methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache compression for memory efficiency
Selective token and quantization strategies
Hybrid optimization and hardware co-design
🔎 Similar Papers
No similar papers found.
Y
Yanyu Liu
Shandong University of Science and Technology, Qingdao 266590, China
J
Jingying Fu
Shandong University of Science and Technology, Qingdao 266590, China
S
Sixiang Liu
Shandong University of Science and Technology, Qingdao 266590, China
Y
Yitian Zou
Shandong University of Science and Technology, Qingdao 266590, China
Y
You Fu
Shandong University of Science and Technology, Qingdao 266590, China
Jiehan Zhou
Jiehan Zhou
Shandong University of Science and Technology
Industrial Large ModelsDigital TwinsIndustry 5.0Internet of ThingsCloud Computing
S
Shouhua Zhang
University of Oulu, Oulu 90014, Finland