SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

To address the challenge of sparse attention in long-sequence autoregressive decoding—balancing accuracy, efficiency, and plug-and-play compatibility—this paper proposes SeerAttention-R, a lightweight sparse attention framework. Methodologically, it introduces (1) the first sparse adaptation design tailored for autoregressive decoding: eliminating query pooling while retaining a learnable self-distillation gating mechanism to dynamically generate high-quality sparse patterns; (2) integration of TileLang-optimized sparse kernels with a FlashAttention-3–compatible interface, enabling H100 GPU–specific acceleration; and (3) near-lossless accuracy recovery with only 0.4B tokens of fine-tuning. Evaluated on the AIME benchmark, SeerAttention-R maintains state-of-the-art accuracy under 4K context length and achieves 9× faster decoding than FlashAttention-3 at 90% sparsity. The implementation is open-sourced.

Technology Category

Application Category

📝 Abstract

We introduce SeerAttention-R, a sparse attention framework specifically tailored for the long decoding of reasoning models. Extended from SeerAttention, SeerAttention-R retains the design of learning attention sparsity through a self-distilled gating mechanism, while removing query pooling to accommodate auto-regressive decoding. With a lightweight plug-in gating, SeerAttention-R is flexible and can be easily integrated into existing pretrained model without modifying the original parameters. We demonstrate that SeerAttention-R, trained on just 0.4B tokens, maintains near-lossless reasoning accuracy with 4K token budget in AIME benchmark under large sparse attention block sizes (64/128). Using TileLang, we develop a highly optimized sparse decoding kernel that achieves near-theoretical speedups of up to 9x over FlashAttention-3 on H100 GPU at 90% sparsity. Code is available at: https://github.com/microsoft/SeerAttention.

Problem

Research questions and friction points this paper is trying to address.

Enhancing long reasoning model decoding with sparse attention

Maintaining accuracy in auto-regressive decoding with minimal tokens

Achieving high-speed sparse decoding on GPUs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse attention for long reasoning models

Lightweight plug-in gating mechanism

Optimized sparse decoding kernel

🔎 Similar Papers

FiDeLiS: Faithful Reasoning in Large Language Model for Knowledge Graph Question Answering