Rho-1: Not All Tokens Are What You Need

📅 2024-04-11

🏛️ arXiv.org

📈 Citations: 37

✨ Influential: 3

career value

164K/year

🤖 AI Summary

Traditional language model pretraining applies next-token prediction loss uniformly across all tokens, leading to inefficiency and distributional bias. This work proposes Selective Language Modeling (SLM), a paradigm that evaluates token importance using a reference model and computes prediction loss only on high-value tokens—approximately 3% of the total—thereby uncovering and leveraging token-level training dynamics for the first time. SLM abandons uniform token-level modeling, enabling simultaneous gains in training efficiency and model performance. On the MATH benchmark, SLM achieves 40.6% and 51.8% accuracy with 1B- and 7B-parameter models, respectively; few-shot mathematical reasoning improves by up to 30%; and general continual pretraining yields an average +6.8% gain across 15 downstream benchmarks. With minimal token-level overhead, SLM approaches state-of-the-art performance, establishing a new paradigm for efficient large-language-model pretraining.

Technology Category

Application Category

📝 Abstract

Previous language model pre-training methods have uniformly applied a next-token prediction loss to all training tokens. Challenging this norm, we posit that"9l training". Our initial analysis examines token-level training dynamics of language model, revealing distinct loss patterns for different tokens. Leveraging these insights, we introduce a new language model called Rho-1. Unlike traditional LMs that learn to predict every next token in a corpus, Rho-1 employs Selective Language Modeling (SLM), which selectively trains on useful tokens that aligned with the desired distribution. This approach involves scoring pretraining tokens using a reference model, and then training the language model with a focused loss on tokens with higher scores. When continual pretraining on 15B OpenWebMath corpus, Rho-1 yields an absolute improvement in few-shot accuracy of up to 30% in 9 math tasks. After fine-tuning, Rho-1-1B and 7B achieved state-of-the-art results of 40.6% and 51.8% on MATH dataset, respectively - matching DeepSeekMath with only 3% of the pretraining tokens. Furthermore, when continual pretraining on 80B general tokens, Rho-1 achieves 6.8% average enhancement across 15 diverse tasks, increasing both efficiency and performance of the language model pre-training.

Problem

Research questions and friction points this paper is trying to address.

Language Model Pretraining

Token Prediction Efficiency

Training Distribution Mismatch

Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective Language Modeling

Rho-1

Efficient Pre-training

🔎 Similar Papers

No similar papers found.