🤖 AI Summary
Small language models (SLMs) suffer from slow convergence in multi-token prediction (MTP) pretraining, limiting downstream performance, generation quality, and speculative decoding speedup. This work introduces curriculum learning to MTP pretraining for the first time, proposing two progressive target-scheduling strategies: forward and backward curricula. The forward curriculum gradually transitions from next-token prediction (NTP) to long-span MTP, improving NTP accuracy (+2.1%), generation quality (BLEU +1.8), and speculative decoding speedup (maintaining 1.9× acceleration), while boosting training efficiency by 17%. In contrast, the backward curriculum—starting from long-span MTP and regressing to NTP—enhances MTP capability but degrades speculative decoding gains. Our analysis reveals that the trajectory of objective complexity evolution critically governs trade-offs among model capabilities. This study establishes a novel paradigm for efficient SLM pretraining, demonstrating that carefully designed curriculum schedules can simultaneously enhance both standard autoregressive performance and inference acceleration.
📝 Abstract
Multi-token prediction (MTP) is a recently proposed pre-training objective for language models. Rather than predicting only the next token (NTP), MTP predicts the next $k$ tokens at each prediction step, using multiple prediction heads. MTP has shown promise in improving downstream performance, inference speed, and training efficiency, particularly for large models. However, prior work has shown that smaller language models (SLMs) struggle with the MTP objective. To address this, we propose a curriculum learning strategy for MTP training, exploring two variants: a forward curriculum, which gradually increases the complexity of the pre-training objective from NTP to MTP, and a reverse curriculum, which does the opposite. Our experiments show that the forward curriculum enables SLMs to better leverage the MTP objective during pre-training, improving downstream NTP performance and generative output quality, while retaining the benefits of self-speculative decoding. The reverse curriculum achieves stronger NTP performance and output quality, but fails to provide any self-speculative decoding benefits.