Towards Efficient Key-Value Cache Management for Prefix Prefilling in LLM Inference

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Key-value cache (KVC) management bottlenecks hinder large language model (LLM) inference under extended-context scenarios. This work identifies, for the first time, the unique metadata management requirements imposed by KVC prefill—requirements unmet by mainstream KV stores (e.g., Redis) and RDMA-accelerated systems (e.g., CHIME, Sherman) due to fundamental limitations in latency, throughput, and cache hit rate. Method: We propose a novel distributed KVC paradigm tailored for LLM inference, built upon a real-trace-driven, prefix-reuse-aware metadata optimization framework. Contribution/Results: Our design significantly improves cache hit rate and throughput while reducing end-to-end inference latency. It delivers the first dedicated, efficient, and scalable KVC management solution—and an empirical benchmark—for high-reuse workloads such as retrieval-augmented generation (RAG) and LLM-based agents.

Technology Category

Application Category

📝 Abstract
The increasing adoption of large language models (LLMs) with extended context windows necessitates efficient Key-Value Cache (KVC) management to optimize inference performance. Inference workloads like Retrieval-Augmented Generation (RAG) and agents exhibit high cache reusability, making efficient caching critical to reducing redundancy and improving speed. We analyze real-world KVC access patterns using publicly available traces and evaluate commercial key-value stores like Redis and state-of-the-art RDMA-based systems (CHIME [1] and Sherman [2]) for KVC metadata management. Our work demonstrates the lack of tailored storage solution for KVC prefilling, underscores the need for an efficient distributed caching system with optimized metadata management for LLM workloads, and provides insights into designing improved KVC management systems for scalable, low-latency inference.
Problem

Research questions and friction points this paper is trying to address.

Efficient KVC management for LLM inference optimization
Cache reusability in RAG and agent workloads
Lack of tailored storage for KVC prefilling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes KVC access patterns using public traces
Evaluates Redis and RDMA-based systems for KVC
Proposes efficient distributed caching for LLM workloads
🔎 Similar Papers
No similar papers found.
Yue Zhu
Yue Zhu
IBM Research
Performance OptimizationI/OStorageCloud
H
Hao Yu
IBM T. J. Watson Research Center, Yorktown Heights, NY , USA
C
Chen Wang
IBM T. J. Watson Research Center, Yorktown Heights, NY , USA
Z
Zhuoran Liu
IBM T. J. Watson Research Center, Yorktown Heights, NY , USA
Eun Kyung Lee
Eun Kyung Lee
IBM T.J Watson Research Center
Autonomic Datacenter Management