Multi-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model Serving

📅 2026-06-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

205K/year
🤖 AI Summary
Existing lossless KV cache management approaches overlook the computational efficiency of GPU attention kernels, resulting in high inference latency. This work proposes AsymCache, a system that, for the first time, incorporates GPU computation latency into cache eviction decisions. By integrating multi-segment attention mechanisms, a position-aware recomputation cost model, and adaptive chunked scheduling, AsymCache jointly optimizes cache hit rates and computational efficiency. Compared to state-of-the-art baselines, AsymCache reduces first-token latency by 1.90–2.03× and per-token generation time by 1.62–1.71×, while achieving an average 18.1% reduction in job latency when deployed in the Continuum system.
📝 Abstract
Large Language Model (LLM) inference relies on key-value (KV) caches to avoid redundant attention computation. While approximate KV cache retention techniques reduce memory usage by sacrificing model accuracy, lossless approaches instead evict KV cache blocks from GPU memory and reconstruct them on demand to preserve exact outputs. Existing lossless KV cache management systems primarily base eviction decisions on access frequency or positional heuristics, without considering how different KV cache blocks affect the execution efficiency of GPU attention kernels. In this paper, we propose AsymCache, a computation-latency-aware KV cache management system for LLM inference that explicitly aligns cache residency decisions with GPU attention kernel performance, including three key components: Multi-Segment Attention (MSA) for efficient non-contiguous KV context processing, a cache eviction policy that jointly optimizes hit rate and position-aware recomputation cost, and an adaptive chunking scheduler for high hardware utilization. Experiments show that AsymCache reduces TTFT by up to 1.90-2.03x and time-per-output-token (TPOT) by 1.62-1.71x over latest baselines, confirming the effectiveness of the method in common workloads and validating its design goal of balancing computational efficiency with cache hit rate. Moreover, the low-level design of AsymCache allows seamless integration into agent serving systems such as Continuum, where it further reduces average job latency by up to 18.1%.
Problem

Research questions and friction points this paper is trying to address.

KV-cache management
Large Language Model inference
GPU attention kernel efficiency
lossless caching
computation latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Segment Attention
KV-Cache Management
Lossless Caching
GPU Attention Kernel
Adaptive Chunking
🔎 Similar Papers
No similar papers found.