Semantic Cache Distillation: Efficient State Transfer via Reuse and Selective Patching

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the communication bottleneck caused by KV cache transmission in large language model inference and the semantic misalignment arising from cross-model cache reuse. To this end, the authors propose a semantic cache distillation framework that reconstructs most layers via a low-rank subspace to reduce transmission overhead, while introducing normalized input prediction in sparse transition layers to suppress error propagation. This approach uniquely integrates semantic encoding with a selective patching mechanism, significantly lowering cache transfer costs without compromising generation quality. Experimental results demonstrate that, under bandwidth-constrained settings, the method achieves up to a 2.65× speedup in first-token latency compared to an ideal consumer prefill strategy, with F1 score degradation limited to within 5%, outperforming baseline approaches such as quantization and selective recomputation.

📝 Abstract

Disaggregated serving alleviates memory bottlenecks in Large Language Model (LLM) inference but creates a severe communication bottleneck: transmitting high-dimensional Key-Value (KV) caches often dominates time-to-first-token (TTFT). Moreover, reusing caches across heterogeneous models (e.g., base and fine-tuned variants) causes semantic misalignment that accumulates over layers, degrading generation quality. We propose Semantic Cache Distillation (SCD), a loss-constrained framework that replaces raw KV transmission with compact semantic codes. SCD addresses these challenges via two mechanisms: (1) Reuse, which reconstructs most layers from low-rank subspaces to minimize transfer cost, and (2) Patch, which predicts normalized inputs at sparse transition layers to truncate error propagation. Empirically, SCD delivers up to 2.65 $\times$ TTFT speedup over the oracle consumer prefill and dominates quantization and selective recomputation baselines on the quality--latency Pareto frontier in bandwidth-constrained regimes, while keeping generation quality within 5\% F1 of the oracle.

Problem

Research questions and friction points this paper is trying to address.

disaggregated serving

KV cache

semantic misalignment

communication bottleneck

LLM inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Cache Distillation

Disaggregated LLM Serving

KV Cache Compression