Approximation Error Upper and Lower Bounds for Hölder Class with Transformers

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work investigates the approximation capability of standard Transformers for Hölder continuous functions and establishes, for the first time, tight upper and lower bounds on the approximation error. By integrating tools from function approximation theory and VC-dimension analysis, the authors prove that achieving accuracy ε requires at least Ω(ε⁻ᵈ⁰⁄⁴ᵅ) Transformer blocks when the architecture employs Softmax attention, ReLU activations, and residual connections. They further construct an explicit realization attaining an upper bound of O(ε⁻ᵈ⁰⁄ᵅ), thereby characterizing the optimal order of network complexity necessary for such approximation. These bounds also yield a corresponding excess risk rate for regression tasks, providing rigorous theoretical guarantees on both the expressive power and generalization performance of Transformers.

📝 Abstract

We explore the expressive power of Transformers by establishing precise approximation error upper and lower bounds for Hölder class. Specifically, a new approximation upper bound is derived for the standard Transformer architecture equipped with Softmax operators, ReLU activation functions, and residual connections. We prove that a Transformer network composed of at most $\mathcal{O}(\varepsilon^{-{d_{0}}/α})$ blocks can approximate any bounded Hölder function with $d_{0}$-dimensional input and smoothness $α\in(0,1]$ under any accuracy $\varepsilon>0$. In the case of approximation lower bounds, leveraging the VC-dimension upper bound, we are the first to rigorously prove that Transformers demand for at least $Ω(\varepsilon^{-{d_{0}}/({4α})})$ blocks to achieve the $\varepsilon$ approximation accuracy. As a final step, we extend the derived results for standard Transformers to a general regression task and establish the corresponding excess risk rates demonstrating Transformers' empirical effectiveness in real-world settings.

Problem

Research questions and friction points this paper is trying to address.

Approximation Error

Hölder Class

Transformers

Upper and Lower Bounds

Expressive Power

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer approximation

Hölder class

approximation bounds