🤖 AI Summary
Existing work lacks a rigorous theoretical characterization of how the number of attention heads affects the approximation capacity of Transformers. Method: We introduce the generalized D-retrieval task as an analytical framework to establish, for the first time, tight parameter-complexity lower bounds on Transformer approximation in nonlinear, practically relevant settings. Contribution/Results: We prove that insufficient attention heads necessitate exponential growth in model parameters for compensation; remarkably, in the single-head case, the feed-forward network alone suffices for perfect memorization. We further uncover a fundamental three-way trade-off among head count, embedding dimension, and sequence length. Experiments on synthetic and real-world tasks validate our theory: sufficient heads markedly improve approximation efficiency, whereas too few heads trigger a sharp increase in parameter requirements.
📝 Abstract
Transformer has become the dominant architecture for sequence modeling, yet a detailed understanding of how its structural parameters influence expressive power remains limited. In this work, we study the approximation properties of transformers, with particular emphasis on the role of the number of attention heads. Our analysis begins with the introduction of a generalized $D$-retrieval task, which we prove to be dense in the space of continuous functions, thereby providing the basis for our theoretical framework. We then establish both upper and lower bounds on the parameter complexity required for $ε$-approximation. Specifically, we show that transformers with sufficiently many heads admit efficient approximation, whereas with too few heads, the number of parameters must scale at least as $O(1/ε^{cT})$, for some constant $c$ and sequence length $T$. To the best of our knowledge, this constitutes the first rigorous lower bound of this type in a nonlinear and practically relevant setting. We further examine the single-head case and demonstrate that an embedding dimension of order $O(T)$ allows complete memorization of the input, where approximation is entirely achieved by the feed-forward block. Finally, we validate our theoretical findings with experiments on both synthetic data and real-world tasks, illustrating the practical relevance of our results.