🤖 AI Summary
Existing KV cache eviction strategies for LLM inference services—particularly generic policies like LRU—suffer from poor adaptability to workload characteristics, leading to suboptimal performance. Method: This paper presents the first empirical analysis of KV caching behavior using real-world cloud service traces, revealing strong skewness, intra-class predictability, and low capacity sensitivity in KV reuse across single- and multi-turn requests. Based on these insights, we propose a workload-aware dynamic eviction policy that jointly models token热度 (access frequency) and时效 (temporal recency), validated via analytical cache modeling, offline trace replay, and online A/B testing. Contribution/Results: On production traces, our approach achieves up to a 23% higher cache hit rate than LRU; under memory-constrained conditions, it reduces end-to-end latency by 18% and increases throughput by over 15%, significantly enhancing deployment efficiency in practical LLM serving systems.
📝 Abstract
Serving large language models (LLMs) is important for cloud providers, and caching intermediate results (KV$) after processing each request substantially improves serving throughput and latency. However, there is limited understanding of how LLM serving benefits from KV$ caching, where system design decisions like cache eviction policies are highly workload-dependent. In this paper, we present the first systematic characterization of the KV$ workload patterns from one of the leading LLM service providers. We draw observations that were not covered by previous studies focusing on synthetic workloads, including: KV$ reuses are skewed across requests, where reuses between single-turn requests are equally important as multi-turn requests; the reuse time and probability are diverse considering all requests, but for a specific request category, the pattern tends to be predictable; and the overall cache size required for an ideal cache hit ratio is moderate. Based on the characterization, we further propose a workload-aware cache eviction policy that improves the serving performance under real-world traces, especially with limited cache capacity.