Variance-Adaptive Muon: Accelerating LLM Pretraining with NSR-Modulated and Variance-Scaled Momentum

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the high computational cost of large language model pretraining and the limited convergence efficiency of existing optimizers. We propose two variance-adaptive variants of the MuON optimizer—MuON-NSR and MuON-VS—which incorporate, respectively, a noise-to-signal ratio (NSR)-modulated mechanism and a hyperparameter-free variance-scaled normalization prior to momentum orthogonalization. To our knowledge, this is the first integration of variance-adaptive principles into the MuON framework. Experimental results on GPT-2 and LLaMA pretraining demonstrate that both variants significantly outperform both tuned AdamW and the original MuON. Notably, MuON-VS reduces the number of iterations required to reach the target validation loss by a factor of 1.36 on LLaMA-1.2B.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) achieve competitive performance across diverse natural language processing (NLP) tasks, yet pretraining is computationally demanding, making optimizer efficiency an important practical consideration. Muon accelerates LLM pretraining via orthogonal momentum updates that serve as a matrix analogue of the element-wise sign operator. Motivated by the recent perspective that Adam is a variance-adaptive sign update algorithm, we propose two variants of Muon, Muon-NSR and Muon-VS, which apply variance-adaptive normalization to momentum before orthogonalization. Muon-NSR applies noise-to-signal ratio (NSR) modulation, while Muon-VS performs variance-based scaling without introducing additional hyperparameters. Experiments on GPT-2 and LLaMA pretraining demonstrate that our proposed methods accelerate convergence and consistently achieve lower validation loss than both competitive, well-tuned AdamW and Muon baselines. For example, on the LLaMA-1.2B model, Muon-NSR and Muon-VS reduce the iterations required to reach the target validation loss by $1.36\times$ relative to the well-tuned Muon following the recent benchmark.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

pretraining

optimizer efficiency

convergence acceleration

validation loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

variance-adaptive

orthogonal momentum

noise-to-signal ratio

momentum scaling

LLM pretraining

🔎 Similar Papers

No similar papers found.

Authors to Follow