Fast Last-Iterate Convergence of SGD in the Smooth Interpolation Regime

📅 2025-07-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work studies the convergence of the last iterate of stochastic gradient descent (SGD) for smooth convex optimization under interpolation, where the gradient noise variance σₙ² at the optimum vanishes or is negligible—common in overparameterized model training and linear system solving. We propose an analysis framework for SGD with constant step size η = 1/β. For the first time, we extend near-optimal last-iterate convergence rates to general smooth convex functions. We prove that the expected excess risk after T iterations is Õ(1/(ηT^{1−βσₙ²/2}) + ηT^{βσₙ²/2}σₙ²). When σₙ² = 0, this yields O(1/√T), improving upon prior O(T⁻¹/⁴) bounds; in the general case, proper tuning achieves the near-optimal rate Õ(1/T + σₙ/√T). Our key innovation lies in deriving a noise-dependent fine-grained convergence bound and establishing—under interpolation—the optimal-order convergence of the last iterate for the first time.

Technology Category

Application Category

📝 Abstract
We study population convergence guarantees of stochastic gradient descent (SGD) for smooth convex objectives in the interpolation regime, where the noise at optimum is zero or near zero. The behavior of the last iterate of SGD in this setting -- particularly with large (constant) stepsizes -- has received growing attention in recent years due to implications for the training of over-parameterized models, as well as to analyzing forgetting in continual learning and to understanding the convergence of the randomized Kaczmarz method for solving linear systems. We establish that after $T$ steps of SGD on $β$-smooth convex loss functions with stepsize $ηleq 1/β$, the last iterate exhibits expected excess risk $widetilde{O}(1/(ηT^{1-βη/2}) + ηT^{βη/2} σ_star^2)$, where $σ_star^2$ denotes the variance of the stochastic gradients at the optimum. In particular, for a well-tuned stepsize we obtain a near optimal $widetilde{O}(1/T + σ_star/sqrt{T})$ rate for the last iterate, extending the results of Varre et al. (2021) beyond least squares regression; and when $σ_star=0$ we obtain a rate of $O(1/sqrt{T})$ with $η=1/β$, improving upon the best-known $O(T^{-1/4})$ rate recently established by Evron et al. (2025) in the special case of realizable linear regression.
Problem

Research questions and friction points this paper is trying to address.

Analyzes last-iterate convergence of SGD in smooth interpolation regime
Studies excess risk bounds for convex loss functions with large stepsizes
Extends convergence results beyond least squares regression
Innovation

Methods, ideas, or system contributions that make the work stand out.

SGD with large constant stepsizes
Optimized stepsize for near-optimal rate
Improved convergence rate for zero noise
🔎 Similar Papers
No similar papers found.
A
Amit Attia
Blavatnik School of Computer Science, Tel Aviv University
M
Matan Schliserman
Blavatnik School of Computer Science, Tel Aviv University
U
Uri Sherman
Blavatnik School of Computer Science, Tel Aviv University
Tomer Koren
Tomer Koren
Associate Professor at Tel Aviv University
Machine LearningOptimizationReinforcement Learning