REG: A Regularization Optimizer for Robust Training Dynamics

📅 2025-10-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Muon optimizers rely on matrix sign functions, leading to training instability and incompatibility with fine-tuning pretrained models optimized via AdamW. This work proposes REG, a novel optimizer whose core innovation replaces the sign function with a Row-Column Scaling (RACS) operator grounded in matrix balancing theory. RACS enables structural-aware gradient regularization while preserving update equilibrium and reducing regularization intensity. Consequently, REG achieves superior compatibility with the AdamW paradigm and mitigates performance degradation during fine-tuning. Experiments demonstrate that REG outperforms AdamW in training stability for large language models (LLMs), avoids the accuracy drop induced by Muon during fine-tuning, and exhibits enhanced robustness across diverse tasks. By unifying structural awareness with optimization stability, REG establishes a new paradigm for efficient and reliable large-model optimization.

Technology Category

Application Category

📝 Abstract
Optimizers are crucial for the efficient training of Large Language Models (LLMs). While AdamW is the de facto standard, recent structure-aware optimizers like Muon have emerged, which regularize gradient updates by operating on entire weight matrices. The Muon optimizer balances the gradient updates along all the directions. However, Muon's reliance on the matrix sign function can lead to training instability, exhibits incompatibility when fine-tuning models pre-trained with AdamW. To address these limitations, we propose extbf{REG}, a novel optimizer that replaces Muon's aggressive matrix sign operator with the Row-and-Column-Scaling (RACS) operator. Theoretically grounded in balancing a matrix, the RACS operator regularizes the update steps in a less drastic manner, making it simpler to implement and more compatible with established training dynamics. Through extensive empirical experiments on LLM training, we demonstrate that our REG optimizer not only achieves superior performance and stability over AdamW, but also maintains consistency with the AdamW training paradigm. This consistency is particularly evident during the fine-tuning stage, where REG optimizer avoids the performance degradation observed with Muon.
Problem

Research questions and friction points this paper is trying to address.

Addresses training instability in matrix-based optimizers
Resolves incompatibility with AdamW during fine-tuning
Improves optimization stability for large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Replaces matrix sign with Row-and-Column-Scaling operator
Regularizes updates in less drastic balancing manner
Maintains compatibility with AdamW training paradigm
🔎 Similar Papers
No similar papers found.
Z
Zehua Liu
Huawei Noah’s Ark Lab
H
Han Wu
Huawei Noah’s Ark Lab
X
Xiaojin Fu
Huawei Noah’s Ark Lab
S
Shuqi Liu
Huawei Noah’s Ark Lab
Xiongwei Han
Xiongwei Han
AI&OR Principal Researcher at Noah's Ark Lab, Huawei
Intelligence ModelingLLMs for OR
T
Tao Zhong
Huawei Noah’s Ark Lab
M
Mingxuan Yuan
Huawei Noah’s Ark Lab