Convergence of Steepest Descent and Adam under Non-Uniform Smoothness

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work investigates the convergence of first-order optimization methods under non-uniform smoothness conditions that better reflect the geometric structure of machine learning loss functions. By modeling the curvature of the objective function as an affine function of its value and incorporating gradient domination conditions, the paper provides a unified analysis of algorithms including steepest descent, RMSProp, and Adam. The theoretical results establish that sign GD achieves linear convergence—faster than standard GD—in logistic regression with separable data and softmax policy gradient settings. Furthermore, on two-layer neural networks, both Adam and RMSProp exhibit linear convergence under constant step sizes and strictly outperform classical methods such as GD and AdaGrad.

📝 Abstract

Recent work has analyzed the convergence of first-order methods under non-uniform smoothness assumptions that better model the loss landscape in machine learning tasks. We generalize this assumption to objectives whose curvature is an affine function of the objective value. This property is satisfied by a broad class of problems, including logistic regression, generalized linear models with a logistic link function, softmax policy gradient in reinforcement learning, and a class of neural networks. Under this assumption and gradient domination conditions, we establish a general convergence rate for the steepest descent method, and deterministic, diagonal variants of RMSProp and Adam. Our results imply that for logistic regression on separable data and the softmax policy gradient objective, sign GD converges linearly and is provably faster than GD. Furthermore, we show that for a class of two-layer neural networks on separable data, RMSProp and Adam can converge at a linear rate with a constant step-size and momentum parameter. Finally, we present a lower bound demonstrating that, under our assumption, RMSProp and Adam are provably faster than AdaGrad, AMSGrad, gradient descent, and heavy-ball momentum.

Problem

Research questions and friction points this paper is trying to address.

non-uniform smoothness

convergence

first-order methods

gradient domination

optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

non-uniform smoothness

affine curvature

linear convergence