Approximation Error Upper and Lower Bounds for Hölder Class with Transformers

📅 2026-05-08
📈 Citations: 0
Influential: 0
📄 PDF

career value

164K/year
🤖 AI Summary
This work investigates the approximation capability of standard Transformers for Hölder continuous functions and establishes, for the first time, tight upper and lower bounds on the approximation error. By integrating tools from function approximation theory and VC-dimension analysis, the authors prove that achieving accuracy ε requires at least Ω(ε⁻ᵈ⁰⁄⁴ᵅ) Transformer blocks when the architecture employs Softmax attention, ReLU activations, and residual connections. They further construct an explicit realization attaining an upper bound of O(ε⁻ᵈ⁰⁄ᵅ), thereby characterizing the optimal order of network complexity necessary for such approximation. These bounds also yield a corresponding excess risk rate for regression tasks, providing rigorous theoretical guarantees on both the expressive power and generalization performance of Transformers.
📝 Abstract
We explore the expressive power of Transformers by establishing precise approximation error upper and lower bounds for Hölder class. Specifically, a new approximation upper bound is derived for the standard Transformer architecture equipped with Softmax operators, ReLU activation functions, and residual connections. We prove that a Transformer network composed of at most $\mathcal{O}(\varepsilon^{-{d_{0}}/α})$ blocks can approximate any bounded Hölder function with $d_{0}$-dimensional input and smoothness $α\in(0,1]$ under any accuracy $\varepsilon>0$. In the case of approximation lower bounds, leveraging the VC-dimension upper bound, we are the first to rigorously prove that Transformers demand for at least $Ω(\varepsilon^{-{d_{0}}/({4α})})$ blocks to achieve the $\varepsilon$ approximation accuracy. As a final step, we extend the derived results for standard Transformers to a general regression task and establish the corresponding excess risk rates demonstrating Transformers' empirical effectiveness in real-world settings.
Problem

Research questions and friction points this paper is trying to address.

Approximation Error
Hölder Class
Transformers
Upper and Lower Bounds
Expressive Power
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer approximation
Hölder class
approximation bounds
VC-dimension
excess risk
X
Xin He
School of Mathematics and Statistics, Wuhan University, Wuhan, China
Yuling Jiao
Yuling Jiao
University of Wuhan
Deep learningScientific and statistical computingInverse problem
X
Xiliang Lu
School of Mathematics and Statistics, Wuhan University, Wuhan, China; Hubei Key Laboratory of Computational Science, Wuhan University, Wuhan, China
J
Jerry Zhijian Yang
School of Mathematics and Statistics, Wuhan University, Wuhan, China; Hubei Key Laboratory of Computational Science, Wuhan University, Wuhan, China