On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization

📅 2026-01-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the degradation of tracking performance in momentum-based stochastic gradient descent (SGD) under non-stationary stochastic optimization, where both data distributions and optimal parameters drift over time. By developing an error decomposition framework that accounts for initialization, noise, and drift components, and leveraging dynamic regret analysis, gradient variation constraints, and strong convexity with smoothness assumptions, the study establishes—for the first time—theoretical evidence that momentum introduces an irreducible “inertial window” in drift-dominated regimes, constituting an information-theoretic performance bottleneck. The authors derive a minimax lower bound on dynamic regret, demonstrating that momentum significantly amplifies tracking error induced by parameter drift, and that this error becomes unbounded as the momentum parameter approaches one. These results rigorously delineate the theoretical conditions under which standard SGD outperforms its momentum variants.

Technology Category

Application Category

📝 Abstract
In this paper, we provide a comprehensive theoretical analysis of Stochastic Gradient Descent (SGD) and its momentum variants (Polyak Heavy-Ball and Nesterov) for tracking time-varying optima under strong convexity and smoothness. Our finite-time bounds reveal a sharp decomposition of tracking error into transient, noise-induced, and drift-induced components. This decomposition exposes a fundamental trade-off: while momentum is often used as a gradient-smoothing heuristic, under distribution shift it incurs an explicit drift-amplification penalty that diverges as the momentum parameter $\beta$ approaches 1, yielding systematic tracking lag. We complement these upper bounds with minimax lower bounds under gradient-variation constraints, proving this momentum-induced tracking penalty is not an analytical artifact but an information-theoretic barrier: in drift-dominated regimes, momentum is unavoidably worse because stale-gradient averaging forces systematic lag. Our results provide theoretical grounding for the empirical instability of momentum in dynamic settings and precisely delineate regime boundaries where vanilla SGD provably outperforms its accelerated counterparts.
Problem

Research questions and friction points this paper is trying to address.

nonstationary optimization
momentum SGD
tracking error
dynamic regret
stochastic optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

momentum SGD
nonstationary optimization
tracking error
dynamic regret
inertia penalty
🔎 Similar Papers
No similar papers found.