KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing KV cache eviction strategies for LLM inference services—particularly generic policies like LRU—suffer from poor adaptability to workload characteristics, leading to suboptimal performance. Method: This paper presents the first empirical analysis of KV caching behavior using real-world cloud service traces, revealing strong skewness, intra-class predictability, and low capacity sensitivity in KV reuse across single- and multi-turn requests. Based on these insights, we propose a workload-aware dynamic eviction policy that jointly models token热度 (access frequency) and时效 (temporal recency), validated via analytical cache modeling, offline trace replay, and online A/B testing. Contribution/Results: On production traces, our approach achieves up to a 23% higher cache hit rate than LRU; under memory-constrained conditions, it reduces end-to-end latency by 18% and increases throughput by over 15%, significantly enhancing deployment efficiency in practical LLM serving systems.

Technology Category

Application Category

📝 Abstract

Serving large language models (LLMs) is important for cloud providers, and caching intermediate results (KV$) after processing each request substantially improves serving throughput and latency. However, there is limited understanding of how LLM serving benefits from KV$ caching, where system design decisions like cache eviction policies are highly workload-dependent. In this paper, we present the first systematic characterization of the KV$ workload patterns from one of the leading LLM service providers. We draw observations that were not covered by previous studies focusing on synthetic workloads, including: KV$ reuses are skewed across requests, where reuses between single-turn requests are equally important as multi-turn requests; the reuse time and probability are diverse considering all requests, but for a specific request category, the pattern tends to be predictable; and the overall cache size required for an ideal cache hit ratio is moderate. Based on the characterization, we further propose a workload-aware cache eviction policy that improves the serving performance under real-world traces, especially with limited cache capacity.

Problem

Research questions and friction points this paper is trying to address.

Characterizing KV cache workload patterns in large-scale LLM services

Optimizing cache eviction policies for diverse request categories

Improving LLM serving performance with limited cache capacity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Characterizes KV cache workload patterns

Proposes workload-aware cache eviction policy

Optimizes LLM serving performance

🔎 Similar Papers

No similar papers found.