GNMR: Runtime Stability Control for Low-Precision Large Language Model Training

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses numerical instability in low-precision large language model training, which often stems from a small number of problematic operators. The authors formulate runtime stability as a controllable problem and propose a lightweight, backend-agnostic sparse recovery mechanism. This approach continuously monitors anomalies using the Gradient Norm-to-Mean Ratio (GNMR) and dynamically triggers recovery operations under a hard maxO budget constraint and brief locking intervals—without altering numerical formats or underlying implementations. Experiments demonstrate that the method achieves high-quality convergence across diverse scenarios, including activation quantization stress tests, DeepSeek-style training, and fine-tuning of LLaMA-2 13B, confirming its effectiveness and broad applicability.

📝 Abstract

Training stability is a key bottleneck in low-precision language model training: efficient low-cost paths can still produce short-lived numerical risks at a small set of operators. We formulate this as runtime stability control and present Gradient Norm-to-Mean Ratio (GNMR), a lightweight controller that compares each recoverable unit's current gradient norm with its historical mean. Together with $Δ$-GNMR for abrupt short-window increases, GNMR maps local risk signals to bounded recovery actions under a hard $\mathrm{maxO}$ budget and a short lock interval, without changing the numerical format, kernel, or backend recipe. Across activation-quantization stress, DeepSeek-style recipe-level training, and LLaMA-2 13B fine-tuning, GNMR preserves high-fidelity quality with sparse, budgeted recovery. These results support GNMR as a backend-agnostic controller to improve low-precision training stability while preserving low-cost execution.

Problem

Research questions and friction points this paper is trying to address.

low-precision training

training stability

numerical risk

large language models

runtime stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

GNMR

low-precision training

runtime stability control