Variance-Adaptive Muon: Accelerating LLM Pretraining with NSR-Modulated and Variance-Scaled Momentum

📅 2026-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of large language model pretraining and the limited convergence efficiency of existing optimizers. We propose two variance-adaptive variants of the MuON optimizer—MuON-NSR and MuON-VS—which incorporate, respectively, a noise-to-signal ratio (NSR)-modulated mechanism and a hyperparameter-free variance-scaled normalization prior to momentum orthogonalization. To our knowledge, this is the first integration of variance-adaptive principles into the MuON framework. Experimental results on GPT-2 and LLaMA pretraining demonstrate that both variants significantly outperform both tuned AdamW and the original MuON. Notably, MuON-VS reduces the number of iterations required to reach the target validation loss by a factor of 1.36 on LLaMA-1.2B.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) achieve competitive performance across diverse natural language processing (NLP) tasks, yet pretraining is computationally demanding, making optimizer efficiency an important practical consideration. Muon accelerates LLM pretraining via orthogonal momentum updates that serve as a matrix analogue of the element-wise sign operator. Motivated by the recent perspective that Adam is a variance-adaptive sign update algorithm, we propose two variants of Muon, Muon-NSR and Muon-VS, which apply variance-adaptive normalization to momentum before orthogonalization. Muon-NSR applies noise-to-signal ratio (NSR) modulation, while Muon-VS performs variance-based scaling without introducing additional hyperparameters. Experiments on GPT-2 and LLaMA pretraining demonstrate that our proposed methods accelerate convergence and consistently achieve lower validation loss than both competitive, well-tuned AdamW and Muon baselines. For example, on the LLaMA-1.2B model, Muon-NSR and Muon-VS reduce the iterations required to reach the target validation loss by $1.36\times$ relative to the well-tuned Muon following the recent benchmark.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
pretraining
optimizer efficiency
convergence acceleration
validation loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

variance-adaptive
orthogonal momentum
noise-to-signal ratio
momentum scaling
LLM pretraining
🔎 Similar Papers
No similar papers found.
J
Jingru Li
College of Artificial Intelligence, Nankai University
Yibo Fan
Yibo Fan
Professor, Fudan University
Video CodingImage ProcessingProcessorVLSI Design
H
Huan Li
College of Artificial Intelligence, Nankai University