🤖 AI Summary
This paper investigates the theoretical underpinnings of Adam’s superior empirical performance over SGD in language model training. While conventional analyses rely on ℓ₂-smoothness, such assumptions fail to capture observed gradient heterogeneity and sparsity across coordinates.
Method: We identify and empirically validate an ℓ∞-geometric structure in the loss landscape—characterized by heterogeneous and sparse coordinate-wise gradient variations—and develop the first adaptive optimization convergence analysis framework based on ℓ∞-smoothness.
Contribution/Results: We rigorously prove that Adam’s coordinate-wise adaptive step sizes exploit this structure, achieving a convergence rate strictly faster than the standard non-convex lower bound; in contrast, SGD is insensitive to it, yielding robust but suboptimal performance. We further propose a block-wise Adam variant and verify our theory on GPT-2 and ResNet: artificially disrupting the ℓ∞-geometry causes a sharp degradation in Adam’s performance while SGD remains stable—confirming this geometry as the key mechanism behind Adam’s advantage.
📝 Abstract
Adam outperforms SGD when training language models. Yet this advantage is not well-understood theoretically -- previous convergence analysis for Adam and SGD mainly focuses on the number of steps $T$ and is already minimax-optimal in non-convex cases, which are both $widetilde{O}(T^{-1/4})$. In this work, we argue that the exploitation of nice $ell_infty$-geometry is the key advantage of Adam over SGD. More specifically, we give a new convergence analysis for Adam under novel assumptions that loss is smooth under $ell_infty$-geometry rather than the more common $ell_2$-geometry, which yields a much better empirical smoothness constant for GPT-2 and ResNet models. Our experiments confirm that Adam performs much worse when the favorable $ell_infty$-geometry is changed while SGD provably remains unaffected. We also extend the convergence analysis to blockwise Adam under novel blockwise smoothness assumptions.