OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality

πŸ“… 2026-06-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge that existing orthogonal momentum methods, which rely on open-loop stepsize schedules, struggle to simultaneously achieve noise robustness and optimal convergence rates in the absence of noise. To overcome this limitation, the paper proposes OptMuon, the first algorithm to integrate closed-loop adaptivity with momentum orthogonalization. It constructs orthogonal momentum directions via polar factorization and employs an AdaGrad-Norm–style coefficient scheduler based on gradient history, enabling adaptive stepsize control without requiring prior knowledge of the smoothness constant or noise level. Under average and individual smoothness assumptions, OptMuon achieves convergence rates of $\widetilde{O}(T^{-1/2} + \sigma^{1/2}T^{-1/4})$ and $\widetilde{O}(T^{-1/2} + \sigma^{1/3}T^{-1/3})$, respectively, and automatically recovers the near-optimal deterministic rate $\widetilde{O}(T^{-1/2})$ when the noise vanishes.
πŸ“ Abstract
Orthogonalized momentum updates, as used in Muon-style optimizers, have recently shown strong empirical stability in large-scale deep learning. However, existing orthogonalized methods are typically paired with constant or open-loop magnitude rules, and therefore do not explicitly calibrate their update magnitudes from the observed optimization trajectory. Motivated by the closed-loop perspective behind Lipschitz-free and noise-adaptive methods, we propose OptMuon, a family of adaptive momentum orthogonalization methods for stochastic nonconvex optimization. OptMuon combines Muon-style polar-factor directions with a trajectory-dependent AdaGrad-Norm-type coefficient schedule, so that the update magnitude is determined by the observed gradient and momentum history rather than by a prescribed Lipschitz-dependent rule. The schedule does not use the smoothness constant, the variance level, or the bounded-gradient constant in parameter selection, and its running-maximum correction prevents isolated gradient spikes from causing excessive coefficient collapse. Under lower-boundedness, unbiased stochastic gradients with bounded variance, smoothness, and an almost-sure bounded stochastic-gradient condition, we prove two complementary guarantees. OptMuon-A achieves the noise-adaptive rate \(\tilde{\mathcal O}(T^{-1/2}+Οƒ^{1/2}T^{-1/4})\) under average smoothness, while OptMuon-I achieves \(\tilde{\mathcal O}(T^{-1/2}+Οƒ^{1/3}T^{-1/3})\) under individual smoothness. In the zero-noise regime, both bounds automatically reduce to a nearly optimal deterministic first-order rate \(\tilde{\mathcal O}(T^{-1/2})\) without manual hyperparameter retuning. These results show that closed-loop scalar adaptation can be combined with Muon-style momentum orthogonalization while retaining noise adaptivity and zero-noise optimality up to logarithmic factors.
Problem

Research questions and friction points this paper is trying to address.

orthogonalized momentum
stochastic optimization
closed-loop adaptation
noise adaptivity
zero-noise optimality
Innovation

Methods, ideas, or system contributions that make the work stand out.

closed-loop adaptation
momentum orthogonalization
noise-adaptive optimization
zero-noise optimality
stochastic nonconvex optimization
πŸ”Ž Similar Papers
No similar papers found.