Exact Sequence Classification with Hardmax Transformers

📅 2025-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work studies the exact classification of arbitrary $N$ finite-length, $d$-dimensional real-valued sequences using hardmax-attention Transformers. We propose an alternating FFN–Attention architecture that leverages low-rank attention weight matrices and exploits the intrinsic clustering property of self-attention. Through constructive theoretical analysis, we establish—for the first time—a rigorous guarantee that $O(N)$ layers and $O(Nd)$ parameters (with complexity independent of sequence length) suffice to achieve 100% classification accuracy for $N$ such $d$-dimensional sequences. Our result achieves the current best-known bounds on both depth and parameter complexity. Moreover, it provides the first rigorous theoretical justification for the strong empirical performance of hardmax-attention Transformers, breaking away from the prevailing soft-attention–centric theoretical frameworks.

Technology Category

Application Category

📝 Abstract
We prove that hardmax attention transformers perfectly classify datasets of $N$ labeled sequences in $mathbb{R}^d$, $dgeq 2$. Specifically, given $N$ sequences with an arbitrary but finite length in $mathbb{R}^d$, we construct a transformer with $mathcal{O}(N)$ blocks and $mathcal{O}(Nd)$ parameters perfectly classifying this dataset. Our construction achieves the best complexity estimate to date, independent of the length of the sequences, by innovatively alternating feed-forward and self-attention layers and by capitalizing on the clustering effect inherent to the latter. Our novel constructive method also uses low-rank parameter matrices within the attention mechanism, a common practice in real-life transformer implementations. Consequently, our analysis holds twofold significance: it substantially advances the mathematical theory of transformers and it rigorously justifies their exceptional real-world performance in sequence classification tasks.
Problem

Research questions and friction points this paper is trying to address.

Classify labeled sequences in Euclidean space
Optimize transformer complexity for sequence classification
Innovate transformer architecture with low-rank matrices
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hardmax attention transformers
Alternating feed-forward layers
Low-rank parameter matrices