SpecAttn: Speculating Sparse Attention

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

To address the quadratic computational overhead of self-attention in large language models (LLMs) induced by context length growth, this paper proposes a training-free, architecture-compatible sparse attention method. It introduces the novel use of draft-model attention weights from speculative decoding to guide key-token selection in the target model. The approach integrates three innovations: KL-divergence-based layer alignment, a GPU-efficient, sort-free top-p token selection algorithm, and dynamic KV cache pruning guided by attention pattern prediction. Evaluated on PG-19, it reduces KV cache accesses by 75.3% with only a 15.29% perplexity increase—substantially outperforming mainstream sparse attention baselines. The core contribution lies in synergistically optimizing both verification approximation and computational compression, thereby achieving a balanced trade-off among inference efficiency, accuracy, and deployment practicality.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) face significant computational bottlenecks during inference due to the quadratic complexity of self-attention mechanisms, particularly as context lengths increase. We introduce SpecAttn, a novel training-free approach that seamlessly integrates with existing speculative decoding techniques to enable efficient sparse attention in pre-trained transformers. Our key insight is to exploit the attention weights already computed by the draft model during speculative decoding to identify important tokens for the target model, eliminating redundant computation while maintaining output quality. SpecAttn employs three core techniques: KL divergence-based layer alignment between draft and target models, a GPU-optimized sorting-free algorithm for top-p token selection from draft attention patterns, and dynamic key-value cache pruning guided by these predictions. By leveraging the computational work already performed in standard speculative decoding pipelines, SpecAttn achieves over 75% reduction in key-value cache accesses with a mere 15.29% increase in perplexity on the PG-19 dataset, significantly outperforming existing sparse attention methods. Our approach demonstrates that speculative execution can be enhanced to provide approximate verification without significant performance degradation.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational bottlenecks in LLM inference

Enables efficient sparse attention in pre-trained transformers

Maintains output quality while eliminating redundant computation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free sparse attention integration with speculative decoding

KL divergence-based layer alignment between draft and target models

Dynamic key-value cache pruning using draft attention patterns

🔎 Similar Papers

How Sparse Attention Approximates Exact Attention? Your Attention is Naturally $n^C$-Sparse