SALS: Sparse Attention in Latent Space for KV cache Compression

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Long-context inference in large language models (LLMs) suffers from severe efficiency bottlenecks due to excessive memory footprint and bandwidth pressure induced by Key-Value (KV) cache storage. Method: This paper proposes a latent-space sparse attention compression framework. We observe that Rotary Position Embedding (RoPE) significantly increases the rank of key vectors, undermining conventional low-rank KV compression. To address this, we introduce a learnable low-rank projection that maps KV states into a compact latent space, where RoPE constraints are removed and query-key matching is performed directly for efficient token importance estimation and sparse KV recovery—avoiding full cache reconstruction. Contribution/Results: Our method achieves 6.4× KV cache compression and 5.7× attention speedup on LLaMA and Mistral. End-to-end throughput improves by 1.4× (4K context) and 4.5× (32K context), setting new state-of-the-art performance.

Technology Category

Application Category

📝 Abstract
Large Language Models capable of handling extended contexts are in high demand, yet their inference remains challenging due to substantial Key-Value cache size and high memory bandwidth requirements. Previous research has demonstrated that KV cache exhibits low-rank characteristics within the hidden dimension, suggesting the potential for effective compression. However, due to the widely adopted Rotary Position Embedding mechanism in modern LLMs, naive low-rank compression suffers severe accuracy degradation or creates a new speed bottleneck, as the low-rank cache must first be reconstructed in order to apply RoPE. In this paper, we introduce two key insights: first, the application of RoPE to the key vectors increases their variance, which in turn results in a higher rank; second, after the key vectors are transformed into the latent space, they largely maintain their representation across most layers. Based on these insights, we propose the Sparse Attention in Latent Space framework. SALS projects the KV cache into a compact latent space via low-rank projection, and performs sparse token selection using RoPE-free query-key interactions in this space. By reconstructing only a small subset of important tokens, it avoids the overhead of full KV cache reconstruction. We comprehensively evaluate SALS on various tasks using two large-scale models: LLaMA2-7b-chat and Mistral-7b, and additionally verify its scalability on the RULER-128k benchmark with LLaMA3.1-8B-Instruct. Experimental results demonstrate that SALS achieves SOTA performance by maintaining competitive accuracy. Under different settings, SALS achieves 6.4-fold KV cache compression and 5.7-fold speed-up in the attention operator compared to FlashAttention2 on the 4K sequence. For the end-to-end throughput performance, we achieves 1.4-fold and 4.5-fold improvement compared to GPT-fast on 4k and 32K sequences, respectively.
Problem

Research questions and friction points this paper is trying to address.

Compressing KV cache to reduce memory bandwidth in large language models
Overcoming accuracy loss from Rotary Position Embedding in compression
Enabling efficient long-context inference with sparse attention mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Projects KV cache into compact latent space
Performs sparse token selection without RoPE
Reconstructs only important tokens to avoid overhead
🔎 Similar Papers
No similar papers found.
J
Junlin Mu
Beijing Jiaotong University, ByteDance Seed
H
Hantao Huang
ByteDance Seed
J
Jihang Zhang
ByteDance Seed
M
Minghui Yu
ByteDance Seed
T
Tao Wang
Beijing Jiaotong University
Yidong Li
Yidong Li
Beijing Jiaotong University
privacy preservingdata miningsocial network analysismultimedia computing