🤖 AI Summary
This work addresses the high computational cost of large language model pretraining and the limited convergence efficiency of existing optimizers. We propose two variance-adaptive variants of the MuON optimizer—MuON-NSR and MuON-VS—which incorporate, respectively, a noise-to-signal ratio (NSR)-modulated mechanism and a hyperparameter-free variance-scaled normalization prior to momentum orthogonalization. To our knowledge, this is the first integration of variance-adaptive principles into the MuON framework. Experimental results on GPT-2 and LLaMA pretraining demonstrate that both variants significantly outperform both tuned AdamW and the original MuON. Notably, MuON-VS reduces the number of iterations required to reach the target validation loss by a factor of 1.36 on LLaMA-1.2B.
📝 Abstract
Large Language Models (LLMs) achieve competitive performance across diverse natural language processing (NLP) tasks, yet pretraining is computationally demanding, making optimizer efficiency an important practical consideration. Muon accelerates LLM pretraining via orthogonal momentum updates that serve as a matrix analogue of the element-wise sign operator. Motivated by the recent perspective that Adam is a variance-adaptive sign update algorithm, we propose two variants of Muon, Muon-NSR and Muon-VS, which apply variance-adaptive normalization to momentum before orthogonalization. Muon-NSR applies noise-to-signal ratio (NSR) modulation, while Muon-VS performs variance-based scaling without introducing additional hyperparameters. Experiments on GPT-2 and LLaMA pretraining demonstrate that our proposed methods accelerate convergence and consistently achieve lower validation loss than both competitive, well-tuned AdamW and Muon baselines. For example, on the LLaMA-1.2B model, Muon-NSR and Muon-VS reduce the iterations required to reach the target validation loss by $1.36\times$ relative to the well-tuned Muon following the recent benchmark.