🤖 AI Summary
This paper investigates the curse of dimensionality in optimizing shallow neural networks under the mean-field framework. We address networks with Lipschitz (including locally Lipschitz) activation functions and smooth target functions. Methodologically, we employ 2-Wasserstein gradient flow modeling, mean-field limit analysis, and a rigorous decomposition of empirical and population risks. Our key contribution is the first theoretical characterization—grounded in function smoothness—of how input dimension $d$ and target function regularity jointly govern convergence rates. Specifically, we prove a sharp population risk upper bound of $t^{-4r/(d-2r)}$, or more generally $t^{-(4+2delta)r/(d-2r)}$, where $r$ is the Hölder exponent of the target function. This result quantitatively reveals a fundamental bottleneck: enhanced smoothness cannot compensate for exponential dimensional penalties in high dimensions. Our analysis establishes a new theoretical benchmark for understanding the intrinsic dimension dependence of neural network optimization.
📝 Abstract
The curse of dimensionality in neural network optimization under the mean-field regime is studied. It is demonstrated that when a shallow neural network with a Lipschitz continuous activation function is trained using either empirical or population risk to approximate a target function that is $r$ times continuously differentiable on $[0,1]^d$, the population risk may not decay at a rate faster than $t^{-frac{4r}{d-2r}}$, where $t$ is an analog of the total number of optimization iterations. This result highlights the presence of the curse of dimensionality in the optimization computation required to achieve a desired accuracy. Instead of analyzing parameter evolution directly, the training dynamics are examined through the evolution of the parameter distribution under the 2-Wasserstein gradient flow. Furthermore, it is established that the curse of dimensionality persists when a locally Lipschitz continuous activation function is employed, where the Lipschitz constant in $[-x,x]$ is bounded by $O(x^delta)$ for any $x in mathbb{R}$. In this scenario, the population risk is shown to decay at a rate no faster than $t^{-frac{(4+2delta)r}{d-2r}}$. To the best of our knowledge, this work is the first to analyze the impact of function smoothness on the curse of dimensionality in neural network optimization theory.