A Bootstrap Perspective on Stochastic Gradient Descent

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper investigates the generalization mechanism underlying stochastic gradient descent’s (SGD) superiority over deterministic gradient descent (GD), attributing it to gradient variability induced by mini-batch sampling, which implicitly models data-generating randomness via statistical bootstrapping. Method: The authors formalize algorithmic variability as the trace of the gradient covariance matrix and prove that SGD implicitly regularizes this trace, thereby favoring parameters robust to sampling noise and yielding stable solutions. They integrate empirical risk minimization (ERM) theory, idealized experiments, and neural network numerical validation, and—novelty—the first explicit incorporation of algorithmic variability estimation as a regularizer in training. Contribution/Results: Experiments demonstrate that explicit regularization significantly improves test performance. The work provides a novel bootstrap-based interpretation of SGD’s generalization advantage and establishes a principled, actionable pathway for its empirical verification.

Technology Category

Application Category

📝 Abstract
Machine learning models trained with emph{stochastic} gradient descent (SGD) can generalize better than those trained with deterministic gradient descent (GD). In this work, we study SGD's impact on generalization through the lens of the statistical bootstrap: SGD uses gradient variability under batch sampling as a proxy for solution variability under the randomness of the data collection process. We use empirical results and theoretical analysis to substantiate this claim. In idealized experiments on empirical risk minimization, we show that SGD is drawn to parameter choices that are robust under resampling and thus avoids spurious solutions even if they lie in wider and deeper minima of the training loss. We prove rigorously that by implicitly regularizing the trace of the gradient covariance matrix, SGD controls the algorithmic variability. This regularization leads to solutions that are less sensitive to sampling noise, thereby improving generalization. Numerical experiments on neural network training show that explicitly incorporating the estimate of the algorithmic variability as a regularizer improves test performance. This fact supports our claim that bootstrap estimation underpins SGD's generalization advantages.
Problem

Research questions and friction points this paper is trying to address.

SGD improves generalization compared to deterministic gradient descent
SGD implicitly regularizes gradient covariance to control algorithmic variability
SGD avoids spurious solutions by promoting robustness under data resampling
Innovation

Methods, ideas, or system contributions that make the work stand out.

SGD uses gradient variability as proxy for data randomness
SGD implicitly regularizes gradient covariance trace to control variability
Explicit algorithmic variability regularization improves neural network test performance
🔎 Similar Papers
No similar papers found.
H
Hongjian Lan
School of Computational Science and Engineering, Georgia Institute of Technology
Y
Yucong Liu
School of Mathematics, Georgia Institute of Technology
Florian Schäfer
Florian Schäfer
Humboldt-Universität zu Berlin
linguisticssyntaxmorphologylexical semanticslinguistic interfaces