Fast Multipole Attention: A Divide-and-Conquer Attention Mechanism for Long Sequences

📅 2023-10-18

🏛️ arXiv.org

📈 Citations: 9

✨ Influential: 1

career value

250K/year

🤖 AI Summary

Transformer’s self-attention incurs O(n²) computational complexity, severely limiting its scalability to long sequences. To address this, we propose a divide-and-conquer attention mechanism inspired by the Fast Multipole Method (FMM), which organizes tokens into a multi-scale hierarchical grouping and interaction architecture. This reduces complexity to O(n log n) or even O(n) while preserving global receptive fields. Our key contributions are threefold: (i) the first adaptation of the n-body fast summation principle—originally from computational physics—to Transformer attention; (ii) learnable hierarchical aggregation with multi-level query/key/value grouping and downsampling; and (iii) a logarithmic-resolution hierarchical structure. Experiments on medium-scale language modeling demonstrate that our method significantly outperforms existing efficient attention models in both memory efficiency and modeling capacity, enabling end-to-end training and generation over substantially longer sequences.

📝 Abstract

Transformer-based models have achieved state-of-the-art performance in many areas. However, the quadratic complexity of self-attention with respect to the input length hinders the applicability of Transformer-based models to long sequences. To address this, we present Fast Multipole Attention, a new attention mechanism that uses a divide-and-conquer strategy to reduce the time and memory complexity of attention for sequences of length $n$ from $mathcal{O}(n^2)$ to $mathcal{O}(n log n)$ or $O(n)$, while retaining a global receptive field. The hierarchical approach groups queries, keys, and values into $mathcal{O}( log n)$ levels of resolution, where groups at greater distances are increasingly larger in size and the weights to compute group quantities are learned. As such, the interaction between tokens far from each other is considered in lower resolution in an efficient hierarchical manner. The overall complexity of Fast Multipole Attention is $mathcal{O}(n)$ or $mathcal{O}(n log n)$, depending on whether the queries are down-sampled or not. This multi-level divide-and-conquer strategy is inspired by fast summation methods from $n$-body physics and the Fast Multipole Method. We perform evaluation on autoregressive and bidirectional language modeling tasks and compare our Fast Multipole Attention model with other efficient attention variants on medium-size datasets. We find empirically that the Fast Multipole Transformer performs much better than other efficient transformers in terms of memory size and accuracy. The Fast Multipole Attention mechanism has the potential to empower large language models with much greater sequence lengths, taking the full context into account in an efficient, naturally hierarchical manner during training and when generating long sequences.

Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic self-attention complexity to linear-logarithmic scale

Enables transformers to handle long sequences and high-resolution inputs

Preserves full-context interactions through multilevel hierarchical attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical attention with O(log n) levels

Nearby tokens interact at full resolution

Distant tokens use learned basis functions

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs