Parameterized Synthetic Text Generation with SimpleStories

📅 2025-04-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing synthetic story datasets (e.g., TinyStories) struggle to simultaneously achieve controllability, diversity, and conciseness. To address this, we introduce SimpleStories: the first large-scale bilingual (English/Japanese, 2M stories each) synthetic story dataset enabling systematic syntactic–semantic control. Methodologically, we propose a novel multi-level abstract feature parameterization for prompting, integrated with LLM-driven generation, cross-lingual consistency constraints, and structured story modeling—ensuring high-fidelity, reproducible, and fine-grained diversity control. Empirical evaluation demonstrates that SimpleStories significantly improves downstream model performance in low-resource training, controllable text generation, and interpretable linguistic analysis. Its design enables precise manipulation of narrative structure, lexical choice, and semantic scope while preserving cross-lingual alignment. The dataset has already been adopted in multiple state-of-the-art foundation model training pipelines, validating its practical utility and scalability.

Technology Category

Application Category

📝 Abstract
We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million stories each in English and Japanese. Our method employs parametrization of prompts with features at multiple levels of abstraction, allowing for systematic control over story characteristics to ensure broad syntactic and semantic diversity. Building on and addressing limitations in the TinyStories dataset, our approach demonstrates that simplicity and variety can be achieved simultaneously in synthetic text generation at scale.
Problem

Research questions and friction points this paper is trying to address.

Generates synthetic stories with controlled characteristics
Ensures syntactic and semantic diversity in text
Addresses limitations of TinyStories dataset
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parametrized prompts for story generation
Multi-level abstraction for diversity control
Large-scale simple language dataset creation
🔎 Similar Papers
No similar papers found.
L
Lennart Finke
ETH Zürich
T
Thomas Dooms
University of Antwerp
M
Mat Allen
Dioptra
Juan Diego Rodriguez
Juan Diego Rodriguez
UT Austin
NLPmachine learning
Noa Nabeshima
Noa Nabeshima
UC Santa Barbara
Dan Braun
Dan Braun
Goodfire