KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing KV cache eviction strategies for LLM inference services—particularly generic policies like LRU—suffer from poor adaptability to workload characteristics, leading to suboptimal performance. Method: This paper presents the first empirical analysis of KV caching behavior using real-world cloud service traces, revealing strong skewness, intra-class predictability, and low capacity sensitivity in KV reuse across single- and multi-turn requests. Based on these insights, we propose a workload-aware dynamic eviction policy that jointly models token热度 (access frequency) and时效 (temporal recency), validated via analytical cache modeling, offline trace replay, and online A/B testing. Contribution/Results: On production traces, our approach achieves up to a 23% higher cache hit rate than LRU; under memory-constrained conditions, it reduces end-to-end latency by 18% and increases throughput by over 15%, significantly enhancing deployment efficiency in practical LLM serving systems.

Technology Category

Application Category

📝 Abstract
Serving large language models (LLMs) is important for cloud providers, and caching intermediate results (KV$) after processing each request substantially improves serving throughput and latency. However, there is limited understanding of how LLM serving benefits from KV$ caching, where system design decisions like cache eviction policies are highly workload-dependent. In this paper, we present the first systematic characterization of the KV$ workload patterns from one of the leading LLM service providers. We draw observations that were not covered by previous studies focusing on synthetic workloads, including: KV$ reuses are skewed across requests, where reuses between single-turn requests are equally important as multi-turn requests; the reuse time and probability are diverse considering all requests, but for a specific request category, the pattern tends to be predictable; and the overall cache size required for an ideal cache hit ratio is moderate. Based on the characterization, we further propose a workload-aware cache eviction policy that improves the serving performance under real-world traces, especially with limited cache capacity.
Problem

Research questions and friction points this paper is trying to address.

Characterizing KV cache workload patterns in large-scale LLM services
Optimizing cache eviction policies for diverse request categories
Improving LLM serving performance with limited cache capacity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Characterizes KV cache workload patterns
Proposes workload-aware cache eviction policy
Optimizes LLM serving performance
🔎 Similar Papers
No similar papers found.
J
Jiahao Wang
Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University
Jinbo Han
Jinbo Han
IPADS, Shanghai Jiao Tong University
AI-Infra
Xingda Wei
Xingda Wei
Shanghai Jiao Tong University
System for AIDistributed systemOperating system
Sijie Shen
Sijie Shen
Peking University
Program AnalysisProgram GenerationDeep Learning
D
Dingyan Zhang
Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University
Chenguang Fang
Chenguang Fang
Alibaba Group
LLM SystemData Management
R
Rong Chen
Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University
Wenyuan Yu
Wenyuan Yu
Alibaba Group
Graph computationdata managementdistributed systems and parallel computation
H
Haibo Chen
Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University