Approximation Bounds for Transformer Networks with Application to Regression

📅 2025-04-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper investigates the approximation capabilities of Transformer networks for functions in Hölder and Sobolev classes, with applications to nonparametric regression under β-mixing dependent observations. Methodologically, it establishes, for the first time, optimal parameter-complexity upper bounds of order ε⁻ᵈˣⁿ⁄ᵞ in Lᵖ-norms (1 ≤ p ≤ ∞); introduces a novel proof framework based on the Kolmogorov–Arnold representation theorem; and reveals a fundamental connection between the column-wise averaging operation in self-attention layers and function smoothness. Theoretical contributions include: (i) relaxing conventional weight-magnitude constraints, thereby substantially enhancing universality; (ii) deriving explicit convergence rates for nonparametric regression under dependent data; and (iii) achieving approximation accuracy matching that of optimal feedforward or recurrent neural networks—thereby providing a foundational theoretical justification for Transformers in non-i.i.d. statistical learning.

Technology Category

Application Category

📝 Abstract
We explore the approximation capabilities of Transformer networks for H""older and Sobolev functions, and apply these results to address nonparametric regression estimation with dependent observations. First, we establish novel upper bounds for standard Transformer networks approximating sequence-to-sequence mappings whose component functions are H""older continuous with smoothness index $gamma in (0,1]$. To achieve an approximation error $varepsilon$ under the $L^p$-norm for $p in [1, infty]$, it suffices to use a fixed-depth Transformer network whose total number of parameters scales as $varepsilon^{-d_x n / gamma}$. This result not only extends existing findings to include the case $p = infty$, but also matches the best known upper bounds on number of parameters previously obtained for fixed-depth FNNs and RNNs. Similar bounds are also derived for Sobolev functions. Second, we derive explicit convergence rates for the nonparametric regression problem under various $eta$-mixing data assumptions, which allow the dependence between observations to weaken over time. Our bounds on the sample complexity impose no constraints on weight magnitudes. Lastly, we propose a novel proof strategy to establish approximation bounds, inspired by the Kolmogorov-Arnold representation theorem. We show that if the self-attention layer in a Transformer can perform column averaging, the network can approximate sequence-to-sequence H""older functions, offering new insights into the interpretability of self-attention mechanisms.
Problem

Research questions and friction points this paper is trying to address.

Study Transformer approximation for Hölder and Sobolev functions
Establish convergence rates for nonparametric regression with dependent data
Prove self-attention can approximate sequence-to-sequence Hölder functions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Establishes upper bounds for Transformer networks
Derives convergence rates for nonparametric regression
Proposes novel proof strategy for approximation bounds
🔎 Similar Papers
No similar papers found.
Y
Yuling Jiao
School of Artificial Intelligence and the School of Mathematics and Statistics, Wuhan University, Wuhan, China
Yanming Lai
Yanming Lai
Department of Mathematics, The Hong Kong University of Science and Technology
Applied mathematics
D
Defeng Sun
Department of Applied Mathematics and the Research Center for Intelligent Operations Research, The Hong Kong Polytechnic University, Hung Hom, Hong Kong, China
Y
Yang Wang
Department of Mathematics, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China
B
Bokai Yan
Department of Mathematics, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China