Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving

📅 2025-03-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the excessive KV cache memory overhead in long-context LLM inference and the trade-off between accuracy and efficiency in existing Dynamic Sparse Attention (DSA) methods, this paper proposes Progressive Sparse Attention (PSA). PSA introduces two key innovations: (1) a dynamic, hierarchical, token-level KV budget allocation strategy grounded in empirical attention weight distributions—departing from rigid top-k sparsification; and (2) a pipelined iterative computation framework with unified GPU memory management, integrating adaptive KV pruning, CPU-GPU cooperative scheduling, and hierarchical heterogeneous memory optimization. Experiments demonstrate that PSA reduces KV cache memory usage by 2.4× over state-of-the-art sparse baselines and by 8.8× over dense baselines, while achieving 1.4×–2.0× end-to-end throughput improvement—effectively reconciling high accuracy with high efficiency.

Technology Category

Application Category

📝 Abstract
Processing long contexts has become a critical capability for modern large language models (LLMs). However, serving long-context LLMs comes with significant inference costs due to the high memory overhead of the key-value (KV) cache. Existing work leverages dynamic sparse attention algorithms (DSAes) to mitigate the KV cache overhead, but these algorithms rely on top-$k$ KV cache selection, which results in a trade-off between accuracy and efficiency. A larger $k$ improves accuracy but decreases efficiency, while a smaller $k$ boosts efficiency but compromises accuracy. To overcome this trade-off, this paper presents PSA, a $underline{P}$rogressive $underline{S}$parse $underline{A}$ttention mechanism that integrates algorithmic innovations with system co-design to achieve both high inference accuracy and improved efficiency in LLM serving. The PSA algorithm adaptively adjusts the KV cache budget of different tokens and layers according to their real attention weight distributions, rather than relying on a fixed budget $k$. This enables high accuracy while minimizing KV cache usage. To further enhance execution efficiency, we introduce a pipelined iteration scheme that reduces CPU-GPU interleaving and synchronization overhead during PSA computation. Additionally, we implement unified GPU memory management that optimizes PSA's memory utilization by accounting for uneven memory requirements across different model layers. Extensive experimental results demonstrate that PSA reduces KV cache usage for attention computation by up to 2.4$ imes$ and 8.8$ imes$, and increases end-to-end serving throughput by up to 1.4$ imes$ and 2.0$ imes$, compared to state-of-the-art DSAes and systems without sparse attention, respectively.
Problem

Research questions and friction points this paper is trying to address.

Reduces KV cache overhead in long-context LLM serving.
Balances accuracy and efficiency in sparse attention mechanisms.
Improves GPU memory management and execution efficiency.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive Sparse Attention adapts KV cache budget dynamically.
Pipelined iteration reduces CPU-GPU synchronization overhead.
Unified GPU memory management optimizes memory utilization.