CALVO: Improve Serving Efficiency for LLM Inferences with Intense Network Demands

📅 2026-03-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the growing performance bottleneck posed by remote KVCache loading in long-context large language model inference, a factor overlooked by existing systems in scheduling decisions. For the first time, it explicitly models KVCache network loading as an independent phase and optimizes it through decoupling from GPU computation, introducing asynchronous pipelined scheduling, and integrating distributed prefix caching with a service-cost-aware request scheduling policy. This approach enables efficient coordination between computation and communication. Evaluated on a real-world testbed, the proposed method substantially improves system efficiency, achieving up to a 61.67% increase in SLO compliance rate.

Technology Category

Application Category

📝 Abstract
Distributed prefix caching has become a core technique for efficient LLM serving. However, for long-context requests with high cache hit ratios, retrieving reusable KVCache blocks from remote servers has emerged as a new performance bottleneck. Such network-intensive LLM inference is expected to become increasingly common as agentic AI workloads continue to grow. However, existing LLM inference engines remain largely compute-centric: they treat KVCache loading as a subordinate phase to GPU execution and often fail to account for its delay explicitly during scheduling. We present CALVO, an LLM serving engine that treats KVCache loading as a first-class concern. CALVO decouples KVCache loading and GPU computation into independently managed, asynchronously progressing stages, enabling better utilization of network, PCIe, and computation resources. In addition, CALVO incorporates KVCache loading delay as an explicit component of per-request service cost, leading to more accurate scheduling decisions. Experiments on a real testbed with diverse long-context workloads show that CALVO substantially improves the efficiency of network-intensive LLM inference, achieving up to 61.67% higher SLO attainment than the baseline.
Problem

Research questions and friction points this paper is trying to address.

LLM inference
KVCache loading
network bottleneck
serving efficiency
long-context requests
Innovation

Methods, ideas, or system contributions that make the work stand out.

distributed prefix caching
KVCache loading
asynchronous scheduling
network-intensive LLM inference
SLO-aware serving
🔎 Similar Papers
No similar papers found.