π€ AI Summary
This work addresses the high cost and instability of hyperparameter selection in continual pretraining of large language models, which typically relies on heuristics or grid search. The study reveals, for the first time, a stable scaling law between hyperparameters and compute budget in continual pretraining and introduces a model-agnostic, state-aware two-stage hyperparameter prediction framework. In the first stage, a small-scale proxy model is used to establish the lossβcompute scaling relationship; in the second, the validation loss of the initial checkpoint is leveraged to infer the equivalent pretraining compute budget, enabling accurate prediction of optimal hyperparameters for the target training run. This approach reduces hyperparameter tuning overhead by up to 90% while achieving performance on par with or superior to baseline methods across diverse model architectures.
π Abstract
The efficacy of continued pre-training for Large Language Models (LLMs) hinges upon hyperparameter configurations, such as learning rate and batch size. However, current practices often rely on heuristics or grid searches, leading to training instability and excessive costs. In this work, we first empirically discover that optimal hyperparameters follow stable and predictable scaling laws throughout the continued pre-training process. Leveraging these insights, we propose a novel framework to establish quantitative relationships between compute budget and optimal hyperparameters for a given checkpoint. Our approach has two stages: (1) \textit{Empirical Law Discovery}, where we train small-scale proxy models to derive functions mapping compute budget to optimal hyperparameters via standard loss-compute scaling laws; and (2) \textit{State-Aware Hyperparameter Prediction}, where we evaluate an initial checkpoint's validation loss and use the inverse scaling law to estimate its \textit{equivalent pre-training compute} -- the compute needed to achieve the same loss from scratch. Combining this with the planned compute budget, we predict optimal hyperparameters for the target run. Empirical results demonstrate that our method reduces the hyperparameter search overhead by up to 90\% while achieving comparable or superior performance relative to baselines. This model-agnostic framework generalizes across architectures, providing a principled and efficient methodology for diverse continued pre-training scenarios starting from any given point.