Optimal Scaling Needs Optimal Norm

📅 2025-10-04

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work investigates the unified scaling规律 of optimal learning rate and batch size (η*, B*) under joint model and dataset scaling. We identify a novel “norm shift” phenomenon and establish that the operator norm of the output-layer weight matrix serves as a key invariant governing optimal hyperparameter selection, leading to a norm-guided scaling principle. Leveraging the Scion optimizer and Disco distributed training framework, we conduct over 2,000 experiments—spanning models up to 1.3B parameters and datasets up to 138B tokens—to empirically validate the universality of this norm-based condition: the output layer exhibits highest sensitivity, while hidden layers require comparatively lower learning rates. Crucially, we derive the first sufficient condition for optimal (η*, B*) pairs explicitly parameterized by dataset scale. This principle significantly enhances training stability and efficiency for large language models.

Technology Category

Application Category

📝 Abstract

Despite recent progress in optimal hyperparameter transfer under model and dataset scaling, no unifying explanatory principle has been established. Using the Scion optimizer, we discover that joint optimal scaling across model and dataset sizes is governed by a single invariant: the operator norm of the output layer. Across models with up to 1.3B parameters trained on up to 138B tokens, the optimal learning rate/batch size pair $(η^{ast}, B^{ast})$ consistently has the same operator norm value - a phenomenon we term norm transfer. This constant norm condition is necessary but not sufficient: while for each dataset size, multiple $(η, B)$ reach the optimal norm, only a unique $(η^{ast}, B^{ast})$ achieves the best loss. As a sufficient condition, we provide the first measurement of $(η^{ast}, B^{ast})$ scaling with dataset size for Scion, and find that the scaling rules are consistent with those of the Adam optimizer. Tuning per-layer-group learning rates also improves model performance, with the output layer being the most sensitive and hidden layers benefiting from lower learning rates. We provide practical insights on norm-guided optimal scaling and release our Distributed Scion (Disco) implementation with logs from over two thousand runs to support research on LLM training dynamics at scale.

Problem

Research questions and friction points this paper is trying to address.

Identifies operator norm as scaling invariant

Determines optimal learning rate-batch size pairs

Establishes scaling rules for large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scion optimizer discovers operator norm invariance

Norm transfer enables optimal hyperparameter scaling

Per-layer learning rate tuning enhances model performance

🔎 Similar Papers

No similar papers found.