🤖 AI Summary
This work investigates the approximation capability of standard Transformers for Hölder continuous functions and establishes, for the first time, tight upper and lower bounds on the approximation error. By integrating tools from function approximation theory and VC-dimension analysis, the authors prove that achieving accuracy ε requires at least Ω(ε⁻ᵈ⁰⁄⁴ᵅ) Transformer blocks when the architecture employs Softmax attention, ReLU activations, and residual connections. They further construct an explicit realization attaining an upper bound of O(ε⁻ᵈ⁰⁄ᵅ), thereby characterizing the optimal order of network complexity necessary for such approximation. These bounds also yield a corresponding excess risk rate for regression tasks, providing rigorous theoretical guarantees on both the expressive power and generalization performance of Transformers.
📝 Abstract
We explore the expressive power of Transformers by establishing precise approximation error upper and lower bounds for Hölder class. Specifically, a new approximation upper bound is derived for the standard Transformer architecture equipped with Softmax operators, ReLU activation functions, and residual connections. We prove that a Transformer network composed of at most $\mathcal{O}(\varepsilon^{-{d_{0}}/α})$ blocks can approximate any bounded Hölder function with $d_{0}$-dimensional input and smoothness $α\in(0,1]$ under any accuracy $\varepsilon>0$. In the case of approximation lower bounds, leveraging the VC-dimension upper bound, we are the first to rigorously prove that Transformers demand for at least $Ω(\varepsilon^{-{d_{0}}/({4α})})$ blocks to achieve the $\varepsilon$ approximation accuracy. As a final step, we extend the derived results for standard Transformers to a general regression task and establish the corresponding excess risk rates demonstrating Transformers' empirical effectiveness in real-world settings.