🤖 AI Summary
This paper investigates the generalization gap of overparameterized models trained via stochastic Langevin dynamics at inverse temperature β⁻¹. Addressing limitations of conventional generalization bounds—which depend on mixing time, dimensionality, gradient norms, or model architecture—we derive the first bound independent of all such quantities: with probability at least 1−δ, the generalization error is bounded by √((β 𝔼L(θ₀) + log(1/δ))/N), where 𝔼L(θ₀) = O(1) under standard initialization. Methodologically, we (1) establish a generalized second law of thermodynamics to quantify the bounded deviation of the parameter distribution from its initial state, and (2) leverage the Gibbs stationary distribution together with marginal divergence control from information theory to enable uniform analysis across arbitrary training times. Our results demonstrate that the generalization gap is governed solely by the initial loss and temperature, revealing temperature’s intrinsic regularizing role in stochastic optimization.
📝 Abstract
We analyze the generalization gap (gap between the training and test errors) when training a potentially over-parametrized model using a Markovian stochastic training algorithm, initialized from some distribution $ heta_0 sim p_0$. We focus on Langevin dynamics with a positive temperature $eta^{-1}$, i.e. gradient descent on a training loss $L$ with infinitesimal step size, perturbed with $eta^{-1}$-variances Gaussian noise, and lightly regularized or bounded. There, we bound the generalization gap, at any time during training, by $sqrt{(etamathbb{E} L ( heta_0) + log(1/delta))/N}$ with probability $1-delta$ over the dataset, where $N$ is the sample size, and $mathbb{E} L ( heta_0) =O(1)$ with standard initialization scaling. In contrast to previous guarantees, we have no dependence on either training time or reliance on mixing, nor a dependence on dimensionality, gradient norms, or any other properties of the loss or model. This guarantee follows from a general analysis of any Markov process-based training that has a Gibbs-style stationary distribution. The proof is surprisingly simple, once we observe that the marginal distribution divergence from initialization remains bounded, as implied by a generalized second law of thermodynamics.