Temperature is All You Need for Generalization in Langevin Dynamics and other Markov Processes

📅 2025-05-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This paper investigates the generalization gap of overparameterized models trained via stochastic Langevin dynamics at inverse temperature β⁻¹. Addressing limitations of conventional generalization bounds—which depend on mixing time, dimensionality, gradient norms, or model architecture—we derive the first bound independent of all such quantities: with probability at least 1−δ, the generalization error is bounded by √((β 𝔼L(θ₀) + log(1/δ))/N), where 𝔼L(θ₀) = O(1) under standard initialization. Methodologically, we (1) establish a generalized second law of thermodynamics to quantify the bounded deviation of the parameter distribution from its initial state, and (2) leverage the Gibbs stationary distribution together with marginal divergence control from information theory to enable uniform analysis across arbitrary training times. Our results demonstrate that the generalization gap is governed solely by the initial loss and temperature, revealing temperature’s intrinsic regularizing role in stochastic optimization.

Technology Category

Application Category

📝 Abstract

We analyze the generalization gap (gap between the training and test errors) when training a potentially over-parametrized model using a Markovian stochastic training algorithm, initialized from some distribution $ heta_0 sim p_0$. We focus on Langevin dynamics with a positive temperature $eta^{-1}$, i.e. gradient descent on a training loss $L$ with infinitesimal step size, perturbed with $eta^{-1}$-variances Gaussian noise, and lightly regularized or bounded. There, we bound the generalization gap, at any time during training, by $sqrt{(etamathbb{E} L ( heta_0) + log(1/delta))/N}$ with probability $1-delta$ over the dataset, where $N$ is the sample size, and $mathbb{E} L ( heta_0) =O(1)$ with standard initialization scaling. In contrast to previous guarantees, we have no dependence on either training time or reliance on mixing, nor a dependence on dimensionality, gradient norms, or any other properties of the loss or model. This guarantee follows from a general analysis of any Markov process-based training that has a Gibbs-style stationary distribution. The proof is surprisingly simple, once we observe that the marginal distribution divergence from initialization remains bounded, as implied by a generalized second law of thermodynamics.

Problem

Research questions and friction points this paper is trying to address.

Analyzing generalization gap in Markovian stochastic training algorithms

Bounding generalization gap for Langevin dynamics with temperature

Providing dimension-independent guarantees for Gibbs-style stationary distributions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Langevin dynamics with positive temperature

Gibbs-style stationary distribution analysis

Bounded marginal distribution divergence

🔎 Similar Papers

Policy Gradients for Optimal Parallel Tempering MCMC

2024-09-03arXiv.orgCitations: 0

Large Language Models as Markov Chains

2024-10-03arXiv.orgCitations: 4

Inferring the Langevin Equation with Uncertainty via Bayesian Neural Networks

2024-02-02Chaos, Solitons & FractalsCitations: 2

Authors to Follow