Semantic Cache Distillation: Efficient State Transfer via Reuse and Selective Patching

📅 2026-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the communication bottleneck caused by KV cache transmission in large language model inference and the semantic misalignment arising from cross-model cache reuse. To this end, the authors propose a semantic cache distillation framework that reconstructs most layers via a low-rank subspace to reduce transmission overhead, while introducing normalized input prediction in sparse transition layers to suppress error propagation. This approach uniquely integrates semantic encoding with a selective patching mechanism, significantly lowering cache transfer costs without compromising generation quality. Experimental results demonstrate that, under bandwidth-constrained settings, the method achieves up to a 2.65× speedup in first-token latency compared to an ideal consumer prefill strategy, with F1 score degradation limited to within 5%, outperforming baseline approaches such as quantization and selective recomputation.
📝 Abstract
Disaggregated serving alleviates memory bottlenecks in Large Language Model (LLM) inference but creates a severe communication bottleneck: transmitting high-dimensional Key-Value (KV) caches often dominates time-to-first-token (TTFT). Moreover, reusing caches across heterogeneous models (e.g., base and fine-tuned variants) causes semantic misalignment that accumulates over layers, degrading generation quality. We propose Semantic Cache Distillation (SCD), a loss-constrained framework that replaces raw KV transmission with compact semantic codes. SCD addresses these challenges via two mechanisms: (1) Reuse, which reconstructs most layers from low-rank subspaces to minimize transfer cost, and (2) Patch, which predicts normalized inputs at sparse transition layers to truncate error propagation. Empirically, SCD delivers up to 2.65 $\times$ TTFT speedup over the oracle consumer prefill and dominates quantization and selective recomputation baselines on the quality--latency Pareto frontier in bandwidth-constrained regimes, while keeping generation quality within 5\% F1 of the oracle.
Problem

Research questions and friction points this paper is trying to address.

disaggregated serving
KV cache
semantic misalignment
communication bottleneck
LLM inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Cache Distillation
Disaggregated LLM Serving
KV Cache Compression
Low-Rank Reconstruction
Error Propagation Truncation
🔎 Similar Papers
No similar papers found.
Q
Qianli Ma
1Faculty of Arts and Science, Beijing Normal University, Zhuhai 519087, China; 2Institute of Artificial Intelligence and Future Networks, Beijing Normal University, Zhuhai 519087, Guangdong, China
Zhiqing Tang
Zhiqing Tang
Associate Professor, Beijing Normal University
Edge ComputingEdge AI SystemsContainerReinforcement Learning
H
Hanshuai Cui
2Institute of Artificial Intelligence and Future Networks, Beijing Normal University, Zhuhai 519087, Guangdong, China; 3School of Artificial Intelligence, Beijing Normal University, Beijing 100875, China
Z
Zhi Yao
2Institute of Artificial Intelligence and Future Networks, Beijing Normal University, Zhuhai 519087, Guangdong, China; 3School of Artificial Intelligence, Beijing Normal University, Beijing 100875, China
Weijia Jia
Weijia Jia
FIEEE, Chair Professor, Beijing Normal University and UIC
Cyber Intelligent ComputingNetworking