🤖 AI Summary
This study investigates the utility of synthetic story data for pretraining language models in low-resource settings under the BabyLM paradigm. Addressing data scarcity, it systematically evaluates the impact of TinyStories and GPT-Neo–generated continuation stories on model performance—the first empirical examination within a developmentally inspired pretraining framework. Methodologically, GPT-Neo is fine-tuned to generate high-quality short stories (<100M tokens), which are jointly trained end-to-end with an LTG-BERT encoder using subset sampling and automated continuation injection. Results show that synthetic story data significantly impairs language understanding—reducing average downstream task performance by 1.2%—while yielding only marginal gains on certain generation tasks. The core contribution is the identification of a previously unreported negative effect of synthetic stories on encoding-oriented tasks, leading to the proposal of a “cautious augmentation” principle. This finding provides critical empirical evidence and practical guidance for data curation strategies in low-resource pretraining.
📝 Abstract
We describe our contribution to the Strict and Strict-Small tracks of the 2nd iteration of the BabyLM Challenge. The shared task is centered around efficient pre-training given data constraints motivated by human development. In response, we study the effect of synthetic story data in language pre-training using TinyStories: a recently introduced dataset of short stories. Initially, we train GPT-Neo models on subsets of TinyStories, while varying the amount of available data. We find that, even with access to less than 100M words, the models are able to generate high-quality, original completions to a given story, and acquire substantial linguistic knowledge. To measure the effect of synthetic story data, we train LTG-BERT encoder models on a combined dataset of: a subset of TinyStories, story completions generated by GPT-Neo, and a subset of the BabyLM dataset. Our experimentation reveals that synthetic data can occasionally offer modest gains, but overall have a negative influence on linguistic understanding. Our work offers an initial study on synthesizing story data in low resource settings and underscores their potential for augmentation in data-constrained language modeling. We publicly release our models and implementation on our GitHub.