🤖 AI Summary
This work quantifies the convergence of the output distribution of shallow neural networks under gradient descent training to its corresponding Gaussian process (GP) in the infinite-width limit. Specifically, it establishes, for the first time, an explicit upper bound on the quadratic Wasserstein distance between the network’s output distribution and the GP at any training time $t geq 0$, revealing a polynomial decay rate of $O(m^{-1/2} + d^{1/2}m^{-1/2})$ in network width $m$ and input dimension $d$. Methodologically, the analysis integrates gradient descent dynamics, probability metric theory, and asymptotic expansions in the infinite-width regime—overcoming prior limitations that restricted analysis to initialization or equilibrium. The result provides a precise finite-width error characterization, substantially strengthening the theoretical foundation of neural–Gaussian process equivalence within the neural tangent kernel (NTK) framework.
📝 Abstract
In this paper, we study the quantitative convergence of shallow neural networks trained via gradient descent to their associated Gaussian processes in the infinite-width limit.
While previous work has established qualitative convergence under broad settings, precise, finite-width estimates remain limited, particularly during training.
We provide explicit upper bounds on the quadratic Wasserstein distance between the network output and its Gaussian approximation at any training time $t ge 0$, demonstrating polynomial decay with network width.
Our results quantify how architectural parameters, such as width and input dimension, influence convergence, and how training dynamics affect the approximation error.