🤖 AI Summary
Existing synthetic story datasets (e.g., TinyStories) struggle to simultaneously achieve controllability, diversity, and conciseness. To address this, we introduce SimpleStories: the first large-scale bilingual (English/Japanese, 2M stories each) synthetic story dataset enabling systematic syntactic–semantic control. Methodologically, we propose a novel multi-level abstract feature parameterization for prompting, integrated with LLM-driven generation, cross-lingual consistency constraints, and structured story modeling—ensuring high-fidelity, reproducible, and fine-grained diversity control. Empirical evaluation demonstrates that SimpleStories significantly improves downstream model performance in low-resource training, controllable text generation, and interpretable linguistic analysis. Its design enables precise manipulation of narrative structure, lexical choice, and semantic scope while preserving cross-lingual alignment. The dataset has already been adopted in multiple state-of-the-art foundation model training pipelines, validating its practical utility and scalability.
📝 Abstract
We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million stories each in English and Japanese. Our method employs parametrization of prompts with features at multiple levels of abstraction, allowing for systematic control over story characteristics to ensure broad syntactic and semantic diversity. Building on and addressing limitations in the TinyStories dataset, our approach demonstrates that simplicity and variety can be achieved simultaneously in synthetic text generation at scale.