Fundamental Limitations on Subquadratic Alternatives to Transformers

📅 2024-10-05

🏛️ International Conference on Learning Representations

📈 Citations: 2

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Whether the quadratic time complexity of Transformer attention can be safely replaced by subquadratic algorithms—such as Mamba or approximate attention—is a fundamental open question. Method: Leveraging fine-grained complexity theory, particularly the Orthogonal Vectors Conjecture, we formally reduce the document similarity task—exact identification of the most similar document pair among *n* documents—to attention computation, and analyze it within standard computational models. Contribution/Results: We provide the first rigorous proof that no subquadratic-time algorithm can solve this task exactly; while Transformers compute it in *O*(*n*²) time, all subquadratic-time models—including existing acceleration and approximation schemes—must fail on worst-case instances. This establishes a theoretical necessity for quadratic attention complexity, yielding the first hardness result grounded in complexity theory that imposes a hard architectural constraint on large language model design.

Technology Category

Application Category

📝 Abstract

The Transformer architecture is widely deployed in many popular and impactful Large Language Models. At its core is the attention mechanism for calculating correlations between pairs of tokens. Performing an attention computation takes quadratic time in the input size, and had become the time bottleneck for transformer operations. In order to circumvent this, researchers have used a variety of approaches, including designing heuristic algorithms for performing attention computations faster, and proposing alternatives to the attention mechanism which can be computed more quickly. For instance, state space models such as Mamba were designed to replace attention with an almost linear time alternative. In this paper, we prove that any such approach cannot perform important tasks that Transformer is able to perform (assuming a popular conjecture from fine-grained complexity theory). We focus on document similarity tasks, where one is given as input many documents and would like to find a pair which is (approximately) the most similar. We prove that Transformer is able to perform this task, and we prove that this task cannot be performed in truly subquadratic time by any algorithm. Thus, any model which can be evaluated in subquadratic time - whether because of subquadratic-time heuristics for attention, faster attention replacements like Mamba, or any other reason - cannot perform this task. In other words, in order to perform tasks that (implicitly or explicitly) involve document similarity, one may as well use Transformer and cannot avoid its quadratic running time.

Problem

Research questions and friction points this paper is trying to address.

Proving subquadratic-time models fail at document similarity tasks

Showing Transformers' quadratic time is unavoidable for certain tasks

Establishing fundamental limits on faster alternatives to attention mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proving subquadratic models fail document similarity tasks

Comparing Transformers with heuristic attention algorithms

Analyzing Mamba as linear-time attention replacement

🔎 Similar Papers

Approximation Rate of the Transformer Architecture for Sequence Modeling