Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

📅 2025-04-10

🏛️ Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

📈 Citations: 105

✨ Influential: 18

career value

170K/year

🤖 AI Summary

Large language models (LLMs) suffer from low data efficiency, typically requiring trillion-word corpora for effective pretraining. Method: Inspired by child language acquisition, this work proposes a cognitively grounded, highly efficient pretraining paradigm using only a developmentally appropriate corpus of under 100 million tokens. We systematically demonstrate—contrary to prevailing assumptions—that such small-scale data can surpass trillion-parameter models’ performance when combined with short-sequence training, knowledge distillation, and multi-task evaluation (covering syntactic competence, downstream task transfer, and out-of-distribution generalization); notably, curriculum learning proves ineffective in this low-data regime. Contribution/Results: Leveraging the LTG-BERT architecture, our best-performing model achieves state-of-the-art results across diverse benchmarks, significantly outperforming standard large baselines. The project yields over 30 empirically validated guidelines—identifying both viable strategies and dead ends—for efficient pretraining, thereby establishing a novel paradigm for cognitive modeling and environmentally sustainable (“green”) AI.

Technology Category

Application Category

📝 Abstract

Children can acquire language from less than 100 million words of input. Large language models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data and still do not perform as well as humans on many evaluations. These intensive resource demands limit the ability of researchers to train new models and use existing models as developmentally plausible cognitive models. The BabyLM Challenge is a communal effort in which participants compete to optimize language model training on a fixed data budget. Submissions are compared on various evaluation tasks targeting grammatical ability, downstream task performance, and generalization. Participants can submit to up to three tracks with progressively looser data restrictions. From over 30 submissions, we extract concrete recommendations on how best to train data-efficient language models, and on where future efforts should (and perhaps should not) focus. The winning submissions using the LTG-BERT architecture (Samuel et al., 2023) outperformed models trained on trillions of words. Other submissions achieved strong results through training on shorter input sequences or training a student model on a pretrained teacher. Curriculum learning attempts, which accounted for a large number of submissions, were largely unsuccessful, though some showed modest improvements.

Problem

Research questions and friction points this paper is trying to address.

Optimizing language model training with limited data input

Improving data efficiency to match human language acquisition

Evaluating models on grammar, task performance, and generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

LTG-BERT architecture for data-efficient training

Training on shorter input sequences

Student model learning from pretrained teacher

🔎 Similar Papers

No similar papers found.