You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

📅 2026-06-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

236K/year
🤖 AI Summary
This work addresses the inefficiency of large language models in long-context decoding and the challenge that existing sparse attention mechanisms struggle to balance computational efficiency with output quality. To this end, the paper proposes Cross-Layer Sparse Attention (CLSA), a novel mechanism that builds upon KV-sharing architectures such as YOCO. CLSA introduces, for the first time, cross-layer sharing of routing indices: a single top-k selection is performed and reused across multiple decoder layers, substantially reducing both computational and memory overhead while preserving token-level sparsity accuracy. By jointly optimizing the three key bottlenecks—prefill computation, KV cache management, and long-context decoding—CLSA achieves up to 7.6× faster decoding and 17.1× higher throughput at 128K context length, all while maintaining high accuracy on both short- and long-context benchmarks.
📝 Abstract
Long-context inference in modern LLMs is increasingly constrained by decoding efficiency, especially in reasoning-heavy settings where models generate long intermediate chains of thought. Existing sparse attention methods often face a practical efficiency-quality trade-off. Structured block sparse methods typically provide stronger acceleration but incur noticeable quality loss, while token sparse methods are usually more accurate yet deliver limited end-to-end speedup because top-k routing over the full cache remains expensive. In this work, we propose cross-layer sparse attention (CLSA), which is built on top of KV-sharing architectures such as YOCO. The core idea is to share not only the KV cache across cross-decoder layers, but also the routing index. A single indexer computes token-level top-k selection once and reuses the resulting index across layers, thereby preserving the fine-grained selectivity of token sparse attention while amortizing the routing overhead. The resulting architecture improves all major inference bottlenecks jointly, including pre-filling, KV-cache storage, and long-context decoding. Experiments across short-context and long-context benchmarks show that CLSA is both accurate and efficient, achieving up to 7.6x decoding speedup and 17.1x overall throughput improvement at 128K context. These results suggest a more complete architectural solution for long-context LLMs that jointly advances model quality and inference efficiency.
Problem

Research questions and friction points this paper is trying to address.

long-context inference
sparse attention
decoding efficiency
KV cache
routing overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-layer sparse attention
shared routing
KV cache sharing
long-context LLMs
efficient inference
🔎 Similar Papers