OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality

📅 2026-06-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge that existing orthogonal momentum methods, which rely on open-loop stepsize schedules, struggle to simultaneously achieve noise robustness and optimal convergence rates in the absence of noise. To overcome this limitation, the paper proposes OptMuon, the first algorithm to integrate closed-loop adaptivity with momentum orthogonalization. It constructs orthogonal momentum directions via polar factorization and employs an AdaGrad-Norm–style coefficient scheduler based on gradient history, enabling adaptive stepsize control without requiring prior knowledge of the smoothness constant or noise level. Under average and individual smoothness assumptions, OptMuon achieves convergence rates of $\widetilde{O}(T^{-1/2} + \sigma^{1/2}T^{-1/4})$ and $\widetilde{O}(T^{-1/2} + \sigma^{1/3}T^{-1/3})$, respectively, and automatically recovers the near-optimal deterministic rate $\widetilde{O}(T^{-1/2})$ when the noise vanishes.

📝 Abstract

Orthogonalized momentum updates, as used in Muon-style optimizers, have recently shown strong empirical stability in large-scale deep learning. However, existing orthogonalized methods are typically paired with constant or open-loop magnitude rules, and therefore do not explicitly calibrate their update magnitudes from the observed optimization trajectory. Motivated by the closed-loop perspective behind Lipschitz-free and noise-adaptive methods, we propose OptMuon, a family of adaptive momentum orthogonalization methods for stochastic nonconvex optimization. OptMuon combines Muon-style polar-factor directions with a trajectory-dependent AdaGrad-Norm-type coefficient schedule, so that the update magnitude is determined by the observed gradient and momentum history rather than by a prescribed Lipschitz-dependent rule. The schedule does not use the smoothness constant, the variance level, or the bounded-gradient constant in parameter selection, and its running-maximum correction prevents isolated gradient spikes from causing excessive coefficient collapse. Under lower-boundedness, unbiased stochastic gradients with bounded variance, smoothness, and an almost-sure bounded stochastic-gradient condition, we prove two complementary guarantees. OptMuon-A achieves the noise-adaptive rate $\tilde{\mathcal O}(T^{-1/2}+σ^{1/2}T^{-1/4})$ under average smoothness, while OptMuon-I achieves $\tilde{\mathcal O}(T^{-1/2}+σ^{1/3}T^{-1/3})$ under individual smoothness. In the zero-noise regime, both bounds automatically reduce to a nearly optimal deterministic first-order rate $\tilde{\mathcal O}(T^{-1/2})$ without manual hyperparameter retuning. These results show that closed-loop scalar adaptation can be combined with Muon-style momentum orthogonalization while retaining noise adaptivity and zero-noise optimality up to logarithmic factors.

Problem

Research questions and friction points this paper is trying to address.

orthogonalized momentum

stochastic optimization

closed-loop adaptation

noise adaptivity

zero-noise optimality

Innovation

Methods, ideas, or system contributions that make the work stand out.

closed-loop adaptation

momentum orthogonalization

noise-adaptive optimization