🤖 AI Summary
We study nonsmooth convex optimization under heavy-tailed noise—where stochastic gradients possess only finite $p$-th moments for $p in (1,2]$. To address this, we refine the theoretical analysis of clipped stochastic gradient descent (Clipped SGD). Our method introduces a generalized notion of effective dimension and integrates an optimized application of Freedman’s inequality with a refined characterization of clipping bias. As a result, we establish the first high-probability convergence rate that improves upon prior bounds. Moreover, in expectation, our upper bound breaks the existing lower bound for this setting and is provably tight: it matches the newly derived information-theoretic lower bound exactly. This constitutes the first analysis framework for heavy-tailed stochastic optimization that achieves optimal expected convergence guarantees, thereby providing foundational theoretical support for robust stochastic optimization.
📝 Abstract
Optimization under heavy-tailed noise has become popular recently, since it better fits many modern machine learning tasks, as captured by empirical observations. Concretely, instead of a finite second moment on gradient noise, a bounded ${frak p}$-th moment where ${frak p}in(1,2]$ has been recognized to be more realistic (say being upper bounded by $σ_{frak l}^{frak p}$ for some $σ_{frak l}ge0$). A simple yet effective operation, gradient clipping, is known to handle this new challenge successfully. Specifically, Clipped Stochastic Gradient Descent (Clipped SGD) guarantees a high-probability rate ${cal O}(σ_{frak l}ln(1/δ)T^{1/{frak p}-1})$ (resp. ${cal O}(σ_{frak l}^2ln^2(1/δ)T^{2/{frak p}-2})$) for nonsmooth convex (resp. strongly convex) problems, where $δin(0,1]$ is the failure probability and $Tinmathbb{N}$ is the time horizon. In this work, we provide a refined analysis for Clipped SGD and offer two faster rates, ${cal O}(σ_{frak l}d_{
m eff}^{-1/2{frak p}}ln^{1-1/{frak p}}(1/δ)T^{1/{frak p}-1})$ and ${cal O}(σ_{frak l}^2d_{
m eff}^{-1/{frak p}}ln^{2-2/{frak p}}(1/δ)T^{2/{frak p}-2})$, than the aforementioned best results, where $d_{
m eff}ge1$ is a quantity we call the $ extit{generalized effective dimension}$. Our analysis improves upon the existing approach on two sides: better utilization of Freedman's inequality and finer bounds for clipping error under heavy-tailed noise. In addition, we extend the refined analysis to convergence in expectation and obtain new rates that break the known lower bounds. Lastly, to complement the study, we establish new lower bounds for both high-probability and in-expectation convergence. Notably, the in-expectation lower bounds match our new upper bounds, indicating the optimality of our refined analysis for convergence in expectation.