q0: Primitives for Hyper-Epoch Pretraining

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the challenge of premature saturation in single-model multi-round pretraining, which hinders efficient utilization of computational resources. To overcome this limitation, the authors propose “ultra-round” pretraining (q0), a framework that integrates three key mechanisms: cyclic scheduling with anticorrelated learning rates and weight decay, chained knowledge distillation, and validation-set-guided prior-weighted ensemble aggregation. These components jointly foster a diverse and high-quality population of models whose predictions are adaptively fused. Evaluated on a 1.8B-parameter model trained with 100 million FineWeb tokens, the approach matches the performance of a strong 256-round ensemble baseline in just 56 rounds—yielding a 12.9× improvement in data efficiency—and demonstrates substantial transfer gains across downstream tasks.

📝 Abstract

Multi-epoch training is becoming the standard now that compute is growing faster than the supply of high-quality text. But pretraining a single model saturates within a few passes, long before the compute budget is exhausted. We argue this calls for a conceptual shift from training a single model toward exploring a population of models and aggregating their predictions. We introduce hyper-epoch pretraining (q0), which turns a multi-epoch budget into a population of diverse models whose combined predictions reach a lower validation loss than a single refined model. q0 reduces to three core primitives. A cyclic schedule with anti-correlated learning rate and weight decay collects diverse models from a few parallel trajectories. Chain distillation trains each model against its predecessor so that model quality compounds across the population. A learned prior, fit on a held out set, selects and weights members for any inference budget. On a 1.8B-parameter model trained on 100M FineWeb tokens, q0 matches a strong 256-epoch ensemble baseline using only ${\sim}56$ epochs (${\sim}4.6\times$ fewer), or ${\sim}67$ epochs (${\sim}3.8\times$ fewer) when matched to the baseline's ensemble size, and continues to improve beyond it. These gains reach cumulative ${\sim}12.9\times$ data efficiency under the Slowrun setting and transfer to downstream benchmarks. Crucially, the optimal allocation shifts with the budget, so we give prescriptive recipes for how to spend a given epoch budget to maximize generalization, from a single epoch up to the largest budgets.

Problem

Research questions and friction points this paper is trying to address.

multi-epoch pretraining

compute budget

model saturation

data efficiency

generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

hyper-epoch pretraining

model population

chain distillation