Infinite-Width Limit of a Single Attention Layer: Analysis via Tensor Programs

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing infinite-width theories for Transformers rely on Gaussian approximations that fail to capture the true asymptotic behavior of attention layers—particularly under standard $1/sqrt{n}$ scaling and finite head counts. Method: We develop the first rigorous theory characterizing the exact limiting distribution of a single-layer attention mechanism as width tends to infinity, without assuming infinitely many heads or nonstandard scaling. Leveraging the Tensor Programs framework, we integrate random matrix theory with conditional distribution analysis to derive its hierarchical structure: conditional on similarity scores, the output is Gaussian; marginally, it is provably non-Gaussian. Contribution/Results: Our theory yields precise, closed-form predictions that align closely with numerical experiments across finite widths and finite head counts. This work establishes the first mathematically sound and practically applicable foundation for a unified infinite-width theory of deep Transformers.

Technology Category

Application Category

📝 Abstract
In modern theoretical analyses of neural networks, the infinite-width limit is often invoked to justify Gaussian approximations of neuron preactivations (e.g., via neural network Gaussian processes or Tensor Programs). However, these Gaussian-based asymptotic theories have so far been unable to capture the behavior of attention layers, except under special regimes such as infinitely many heads or tailored scaling schemes. In this paper, leveraging the Tensor Programs framework, we rigorously identify the infinite-width limit distribution of variables within a single attention layer under realistic architectural dimensionality and standard $1/sqrt{n}$-scaling with $n$ dimensionality. We derive the exact form of this limit law without resorting to infinite-head approximations or tailored scalings, demonstrating that it departs fundamentally from Gaussianity. This limiting distribution exhibits non-Gaussianity from a hierarchical structure, being Gaussian conditional on the random similarity scores. Numerical experiments validate our theoretical predictions, confirming the effectiveness of our theory at finite width and accurate description of finite-head attentions. Beyond characterizing a standalone attention layer, our findings lay the groundwork for developing a unified theory of deep Transformer architectures in the infinite-width regime.
Problem

Research questions and friction points this paper is trying to address.

Analyzing infinite-width limit of single attention layer
Deriving non-Gaussian limit law without special scaling
Laying groundwork for infinite-width Transformer theory
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tensor Programs analyze infinite-width attention layers
Non-Gaussian limit law without infinite-head approximations
Hierarchical structure explains attention layer behavior
🔎 Similar Papers
No similar papers found.