The Effect of Attention Head Count on Transformer Approximation

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing work lacks a rigorous theoretical characterization of how the number of attention heads affects the approximation capacity of Transformers. Method: We introduce the generalized D-retrieval task as an analytical framework to establish, for the first time, tight parameter-complexity lower bounds on Transformer approximation in nonlinear, practically relevant settings. Contribution/Results: We prove that insufficient attention heads necessitate exponential growth in model parameters for compensation; remarkably, in the single-head case, the feed-forward network alone suffices for perfect memorization. We further uncover a fundamental three-way trade-off among head count, embedding dimension, and sequence length. Experiments on synthetic and real-world tasks validate our theory: sufficient heads markedly improve approximation efficiency, whereas too few heads trigger a sharp increase in parameter requirements.

Technology Category

Application Category

📝 Abstract
Transformer has become the dominant architecture for sequence modeling, yet a detailed understanding of how its structural parameters influence expressive power remains limited. In this work, we study the approximation properties of transformers, with particular emphasis on the role of the number of attention heads. Our analysis begins with the introduction of a generalized $D$-retrieval task, which we prove to be dense in the space of continuous functions, thereby providing the basis for our theoretical framework. We then establish both upper and lower bounds on the parameter complexity required for $ε$-approximation. Specifically, we show that transformers with sufficiently many heads admit efficient approximation, whereas with too few heads, the number of parameters must scale at least as $O(1/ε^{cT})$, for some constant $c$ and sequence length $T$. To the best of our knowledge, this constitutes the first rigorous lower bound of this type in a nonlinear and practically relevant setting. We further examine the single-head case and demonstrate that an embedding dimension of order $O(T)$ allows complete memorization of the input, where approximation is entirely achieved by the feed-forward block. Finally, we validate our theoretical findings with experiments on both synthetic data and real-world tasks, illustrating the practical relevance of our results.
Problem

Research questions and friction points this paper is trying to address.

Analyzes how attention head count affects transformer approximation capabilities
Establishes theoretical bounds on parameter complexity for efficient approximation
Investigates single-head transformer limitations and feed-forward network roles
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzed transformer approximation with attention head count
Established parameter complexity bounds for efficient approximation
Demonstrated embedding dimension enables input memorization
🔎 Similar Papers
No similar papers found.
P
Penghao Yu
Department of Mathematics, National University of Singapore
Zeyu Bao
Zeyu Bao
National University of Singapore
Machine Learning
H
Haotian Jiang
Institute for Functional Intelligent Materials, National University of Singapore
R
Ruoxi Yu
Center for Data Science, Peking University
Qianxiao Li
Qianxiao Li
Assistant Professor, Department of Mathematics and Institute for Functional Intelligent Materials
applied mathematicsmachine learningscientific computingcontrol theorymaterials science