How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

In curriculum-based LLM pretraining, standard learning rate decay schedules are incompatible with the increasing data quality order, leading to excessively low learning rates—and thus suboptimal optimization—during high-quality data stages, thereby undermining curriculum learning benefits. This work identifies and analyzes this misalignment mechanism. We propose two lightweight, computation- and data-free improvements: (1) attenuated learning rate decay, and (2) end-of-stage model averaging. Evaluated on a 1.5B-parameter model trained over 30B tokens, both methods jointly yield an average performance gain of +1.64% across multiple benchmarks—significantly outperforming the random shuffling baseline. To our knowledge, this is the first systematic study to expose the critical coupling between optimization scheduling and curriculum design; we empirically validate that co-designing these components is essential for realizing the full potential of curriculum learning in LLM pretraining.

Technology Category

Application Category

📝 Abstract

Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.

Problem

Research questions and friction points this paper is trying to address.

Identifies incompatibility between curriculum training and learning rate decay

Proposes moderate LR decay and model averaging to improve curriculum pretraining

Enhances LLM performance without additional data refinement through optimized scheduling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Moderate learning rate decay preserves curriculum benefits

Model averaging replaces learning rate decay effectively

Combining strategies improves benchmark scores without extra data

🔎 Similar Papers

Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review