🤖 AI Summary
This work investigates the dynamical mechanisms of online stochastic gradient descent (SGD) training for two-layer neural networks under Gaussian input data, focusing on the challenging regime of wide networks (P ≫ 1) with power-law decay in second-layer coefficients—leading to divergent condition number. We develop a multiscale analytical framework integrating Hermite polynomial expansions, random matrix theory, and nonlinear dynamical systems analysis. Our approach yields the first precise characterization of sharp learning transition times along individual signal directions and proves that the superposition of emergent individual learning curves induces a smooth scaling law for the overall mean squared error (MSE). We derive exact exponents quantifying MSE dependence on sample size *n*, SGD iteration count *t*, and parameter count *P*: MSE ∝ *n*⁻ᵃ *t*⁻ᵇ *P*⁻ᶜ. Crucially, the learning phase transition threshold is shown to depend explicitly on the information exponent *k*∗ > 2 and the parity of the activation function. Theory and experiments exhibit strong agreement, revealing an intrinsic unification of emergence and scaling behavior in large-scale networks.
📝 Abstract
We study the complexity of online stochastic gradient descent (SGD) for learning a two-layer neural network with $P$ neurons on isotropic Gaussian data: $f_*(oldsymbol{x}) = sum_{p=1}^P a_pcdot sigma(langleoldsymbol{x},oldsymbol{v}_p^*
angle)$, $oldsymbol{x} sim mathcal{N}(0,oldsymbol{I}_d)$, where the activation $sigma:mathbb{R} omathbb{R}$ is an even function with information exponent $k_*>2$ (defined as the lowest degree in the Hermite expansion), ${oldsymbol{v}^*_p}_{pin[P]}subset mathbb{R}^d$ are orthonormal signal directions, and the non-negative second-layer coefficients satisfy $sum_{p} a_p^2=1$. We focus on the challenging ``extensive-width'' regime $Pgg 1$ and permit diverging condition number in the second-layer, covering as a special case the power-law scaling $a_pasymp p^{-eta}$ where $etainmathbb{R}_{ge 0}$. We provide a precise analysis of SGD dynamics for the training of a student two-layer network to minimize the mean squared error (MSE) objective, and explicitly identify sharp transition times to recover each signal direction. In the power-law setting, we characterize scaling law exponents for the MSE loss with respect to the number of training samples and SGD steps, as well as the number of parameters in the student neural network. Our analysis entails that while the learning of individual teacher neurons exhibits abrupt transitions, the juxtaposition of $Pgg 1$ emergent learning curves at different timescales leads to a smooth scaling law in the cumulative objective.