Why Muon Outperforms Adam: A Curvature Perspective

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

192K/year
🤖 AI Summary
This work investigates the fundamental reasons behind Muon optimizer’s superior performance over Adam in large language model training, with a focus on its advantages in navigating local geometric structures. By analyzing loss reduction through second-order Taylor expansion, the study attributes Muon’s gains— for the first time—to its lower normalized directional sharpness (NDS) and introduces a curvature penalty decomposition framework. Theoretical and empirical analyses demonstrate that Muon achieves greater per-step loss reduction at equivalent validation loss by more evenly distributing update energy across curvature modes, thereby reducing both average NDS and local quadratic loss. This effect is significantly modulated by data imbalance and model architecture. The methodology integrates Zipf-PCFG synthetic data, intra- and inter-layer curvature analysis, and heterogeneous curvature quadratic form theory.
📝 Abstract
Muon improves training efficiency over Adam in large language-model training by about two times, but the local geometric source of this advantage remains unclear. Our work takes a first step toward demystifying Muon's superiority over Adam from a curvature perspective. First, we apply a second-order Taylor approximation to the training landscape and show that Muon achieves a larger one-step loss decrease than Adam at matched validation loss. The two optimizers have comparable first-order gains, but Muon consistently incurs a smaller second-order curvature penalty. Second, we decompose this curvature penalty into the squared update norm and Normalized Directional Sharpness (NDS). We find that Muon and Adam have comparable update norms, so Muon's smaller curvature penalty is driven by lower NDS, not update scale. Third, we study how training data and model structure shape Muon's NDS advantage. Using Zipf-Probabilistic Context-Free Grammar (PCFG) data with controlled imbalance, we show that data imbalance amplifies Muon's NDS advantage over Adam. A within-/cross-layer decomposition further shows that, in the middle and late stages of training, Muon's lower NDS is mainly sustained by smaller within-layer curvature. Beyond empirical evidence, we analyze stylized quadratic problems with heterogeneous curvature and gradient alignment toward high-curvature modes. We prove that Muon attains a smaller average NDS than GD by balancing update energy across curvature groups; when curvature heterogeneity is sufficiently strong, this also yields lower local quadratic loss after the same number of steps.
Problem

Research questions and friction points this paper is trying to address.

curvature
optimization
large language models
training efficiency
Normalized Directional Sharpness
Innovation

Methods, ideas, or system contributions that make the work stand out.

curvature
Normalized Directional Sharpness
optimizer comparison
second-order analysis
training dynamics