Flatland: The Adventures of Gradient Descent with Large Step Sizes

📅 2026-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of determining appropriate step sizes for gradient descent in non-globally smooth neural network loss landscapes. It introduces a unified notion of “large step size” based solely on local Lipschitz (or Hölder) continuity of the gradient and proposes an adaptive first-order optimization method. By incorporating a self-stabilizing mechanism, the algorithm operates at the Edge of Stability (EoS), maintaining the product of step size and Hessian sharpness above the critical threshold of 2, which enables non-monotonic yet effective loss reduction. The method provides the first convergence guarantee for large step sizes without requiring global smoothness assumptions, reveals that premature entry into flat regions harms both convergence and generalization, and successfully drives sharpness to its global minimum, thereby substantially improving training success rates and model generalization.
📝 Abstract
The training of neural networks often entails objective functions that are not globally $L$-smooth. For these functions, it is both theoretically and practically difficult to reply to the question: what is the largest possible step size that ensures the convergence of gradient descent (GD)? We address this longstanding open question in deep learning by providing a unifying definition of "large" step sizes that requires only local Lipschitz (or even Hölder) continuity of the gradient. We design first-order adaptive methods that provably yield large step sizes and show that they operate at the edge of stability (EoS) right from the start of the training. In particular, the loss decreases nonmonotonically and the product between the step size and sharpness, i.e., the largest eigenvalue of the Hessian, stays above the EoS threshold of 2 throughout training. Using our method, we are also able to minimize the sharpness all the way down to its global minimum. Contrary to expectation, we find that encountering globally-flat regions too early in the training may both slow down convergence and jeopardize the generalization ability of the network. Exploiting a self-stabilization argument, we allow GD to enter slightly sharper valleys and turn unsuccessful training runs into very successful ones.
Problem

Research questions and friction points this paper is trying to address.

gradient descent
large step size
non-smooth optimization
edge of stability
neural network training
Innovation

Methods, ideas, or system contributions that make the work stand out.

large step size
edge of stability
sharpness minimization
adaptive gradient descent
non-monotonic convergence
🔎 Similar Papers
2024-07-25arXiv.orgCitations: 0