🤖 AI Summary
Traditional language model pretraining applies next-token prediction loss uniformly across all tokens, leading to inefficiency and distributional bias. This work proposes Selective Language Modeling (SLM), a paradigm that evaluates token importance using a reference model and computes prediction loss only on high-value tokens—approximately 3% of the total—thereby uncovering and leveraging token-level training dynamics for the first time. SLM abandons uniform token-level modeling, enabling simultaneous gains in training efficiency and model performance. On the MATH benchmark, SLM achieves 40.6% and 51.8% accuracy with 1B- and 7B-parameter models, respectively; few-shot mathematical reasoning improves by up to 30%; and general continual pretraining yields an average +6.8% gain across 15 downstream benchmarks. With minimal token-level overhead, SLM approaches state-of-the-art performance, establishing a new paradigm for efficient large-language-model pretraining.
📝 Abstract
Previous language model pre-training methods have uniformly applied a next-token prediction loss to all training tokens. Challenging this norm, we posit that"9l training". Our initial analysis examines token-level training dynamics of language model, revealing distinct loss patterns for different tokens. Leveraging these insights, we introduce a new language model called Rho-1. Unlike traditional LMs that learn to predict every next token in a corpus, Rho-1 employs Selective Language Modeling (SLM), which selectively trains on useful tokens that aligned with the desired distribution. This approach involves scoring pretraining tokens using a reference model, and then training the language model with a focused loss on tokens with higher scores. When continual pretraining on 15B OpenWebMath corpus, Rho-1 yields an absolute improvement in few-shot accuracy of up to 30% in 9 math tasks. After fine-tuning, Rho-1-1B and 7B achieved state-of-the-art results of 40.6% and 51.8% on MATH dataset, respectively - matching DeepSeekMath with only 3% of the pretraining tokens. Furthermore, when continual pretraining on 80B general tokens, Rho-1 achieves 6.8% average enhancement across 15 diverse tasks, increasing both efficiency and performance of the language model pre-training.