Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers

📅 2024-05-17
🏛️ arXiv.org
📈 Citations: 4
Influential: 0
📄 PDF

career value

209K/year
🤖 AI Summary
To address the high latency and memory bottlenecks of attention computation in Transformer decoding under long contexts, this paper proposes a hardware-aware, scalable online attention mechanism. Methodologically, it reformulates the associativity of online softmax as a parallel reduction operation and introduces a stream-K–style blocked reduction execution flow; it further couples key-value cache co-scheduling with hardware-adapted kernel fusion. The approach significantly improves computational and memory-access efficiency during decoding: on 512K-context sequences, it achieves up to 8.33× speedup over baseline implementations and averages 2.6× acceleration over FlashAttention-2. This substantially reduces latency for long-text generation, delivering an efficient and scalable low-level foundation for real-time inference of large language models.

Technology Category

Application Category

📝 Abstract
Transformer-based models have emerged as one of the most widely used architectures for natural language processing, natural language generation, and image generation. The size of the state-of-the-art models has increased steadily reaching billions of parameters. These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators, such as GPUs. Specifically, the time and memory complexity of the attention operation is quadratic in terms of the total context length, i.e., prompt and output tokens. Thus, several optimizations such as key-value tensor caching and FlashAttention computation have been proposed to deliver the low latency demands of applications relying on such large models. However, these techniques do not cater to the computationally distinct nature of different phases during inference. To that end, we propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase (decode-phase) of decoder-only transformer models. LeanAttention enables scaling the attention mechanism implementation for the challenging case of long context lengths by re-designing the execution flow for the decode-phase. We identify that the associative property of online softmax can be treated as a reduction operation thus allowing us to parallelize the attention computation over these large context lengths. We extend the"stream-K"style reduction of tiled calculation to self-attention to enable parallel computation resulting in an average of 2.6x attention execution speedup over FlashAttention-2 and up to 8.33x speedup for 512k context lengths.
Problem

Research questions and friction points this paper is trying to address.

Long-text Processing
Transformer Model
Efficiency Improvement
Innovation

Methods, ideas, or system contributions that make the work stand out.

LeanAttention
Transformer Optimization
Parallel Computing
🔎 Similar Papers
No similar papers found.