SparseX: Efficient Segment-Level KV Cache Sharing for Interleaved LLM Serving

📅 2026-06-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

167K/year
🤖 AI Summary
This work addresses the inefficiency of conventional KV caching in reusing non-prefix, cross-request, cross-turn, and cross-agent repetitive content, which incurs substantial computational overhead during the prefill phase of long-context large language models. To overcome this, the authors propose SparseX, a segment-level KV cache sharing method that treats contiguous token segments as reusable units and recovers complex interleaved context interactions via sparse recomputation within a single forward pass. Its key innovations include a hybrid attention mechanism leveraging Sparse-Q indexing and layer-adaptive thresholds, enabling effective support for multi-turn dialogues, retrieval-augmented generation (RAG), and agent workflows. Integrated seamlessly into vLLM through segmented cache lookup, PagedAttention, RoPE alignment, and FlashAttention, SparseX provides a model-agnostic, training-free, and Prefix Cache–compatible unified execution path. Experiments demonstrate that SparseX significantly reduces first-token latency and computational cost while improving cache efficiency and generation quality.
📝 Abstract
In long-context LLM serving, the prefill stage often dominates time-to-first-token and computational cost. Although Prefix Cache in vLLM/PagedAttention has been widely used to reuse identical prompt prefixes, repeated content in practical applications frequently appears as non-prefix, cross-request, cross-turn, and cross-agent segments, which makes conventional cache mechanisms insufficient. This paper presents SparseX, a segment-level KV Cache sharing method for common serving scenarios. SparseX uses contiguous token segments as reuse units and exploits Sparse-Q indices that naturally arise in KV Cache reuse workloads to estimate the key tokens that require correction. Based on this estimate, SparseX performs Sparse-KV Recomputation within a single forward pass, thereby restoring cross-segment contextual interactions under complex interleaved reuse patterns while avoiding additional models or separate preprocessing stages for token selection. SparseX further implements a full+sparse hybrid attention mode based on a layer-specific threshold: early layers retain full attention to obtain a more stable token-importance signal, and later layers switch to sparse recomputation to improve reuse quality on complex long-context tasks. We implement SparseX-vLLM on top of vLLM, integrating segment-level cache lookup, PagedAttention management, RoPE alignment, Sparse-Q token selection, and FlashAttention backends into a unified execution path. SparseX is model-agnostic, training-free, and compatible with Prefix Cache, and it provides unified support for common online serving scenarios including multi-round chat, retrieval-augmented generation (RAG), and agent workflows.
Problem

Research questions and friction points this paper is trying to address.

KV Cache sharing
long-context LLM serving
non-prefix reuse
interleaved serving
segment-level caching
Innovation

Methods, ideas, or system contributions that make the work stand out.

segment-level KV cache sharing
sparse recomputation
interleaved LLM serving
hybrid attention
token importance estimation
🔎 Similar Papers
2024-05-26Proceedings of the Twentieth European Conference on Computer SystemsCitations: 7