🤖 AI Summary
This work studies the exact classification of arbitrary $N$ finite-length, $d$-dimensional real-valued sequences using hardmax-attention Transformers. We propose an alternating FFN–Attention architecture that leverages low-rank attention weight matrices and exploits the intrinsic clustering property of self-attention. Through constructive theoretical analysis, we establish—for the first time—a rigorous guarantee that $O(N)$ layers and $O(Nd)$ parameters (with complexity independent of sequence length) suffice to achieve 100% classification accuracy for $N$ such $d$-dimensional sequences. Our result achieves the current best-known bounds on both depth and parameter complexity. Moreover, it provides the first rigorous theoretical justification for the strong empirical performance of hardmax-attention Transformers, breaking away from the prevailing soft-attention–centric theoretical frameworks.
📝 Abstract
We prove that hardmax attention transformers perfectly classify datasets of $N$ labeled sequences in $mathbb{R}^d$, $dgeq 2$. Specifically, given $N$ sequences with an arbitrary but finite length in $mathbb{R}^d$, we construct a transformer with $mathcal{O}(N)$ blocks and $mathcal{O}(Nd)$ parameters perfectly classifying this dataset. Our construction achieves the best complexity estimate to date, independent of the length of the sequences, by innovatively alternating feed-forward and self-attention layers and by capitalizing on the clustering effect inherent to the latter. Our novel constructive method also uses low-rank parameter matrices within the attention mechanism, a common practice in real-life transformer implementations. Consequently, our analysis holds twofold significance: it substantially advances the mathematical theory of transformers and it rigorously justifies their exceptional real-world performance in sequence classification tasks.