Parameterized Synthetic Text Generation with SimpleStories

📅 2025-04-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing synthetic story datasets (e.g., TinyStories) struggle to simultaneously achieve controllability, diversity, and conciseness. To address this, we introduce SimpleStories: the first large-scale bilingual (English/Japanese, 2M stories each) synthetic story dataset enabling systematic syntactic–semantic control. Methodologically, we propose a novel multi-level abstract feature parameterization for prompting, integrated with LLM-driven generation, cross-lingual consistency constraints, and structured story modeling—ensuring high-fidelity, reproducible, and fine-grained diversity control. Empirical evaluation demonstrates that SimpleStories significantly improves downstream model performance in low-resource training, controllable text generation, and interpretable linguistic analysis. Its design enables precise manipulation of narrative structure, lexical choice, and semantic scope while preserving cross-lingual alignment. The dataset has already been adopted in multiple state-of-the-art foundation model training pipelines, validating its practical utility and scalability.

Technology Category

Application Category

📝 Abstract

We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million stories each in English and Japanese. Our method employs parametrization of prompts with features at multiple levels of abstraction, allowing for systematic control over story characteristics to ensure broad syntactic and semantic diversity. Building on and addressing limitations in the TinyStories dataset, our approach demonstrates that simplicity and variety can be achieved simultaneously in synthetic text generation at scale.

Problem

Research questions and friction points this paper is trying to address.

Generates synthetic stories with controlled characteristics

Ensures syntactic and semantic diversity in text

Addresses limitations of TinyStories dataset

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parametrized prompts for story generation

Multi-level abstraction for diversity control

Large-scale simple language dataset creation

🔎 Similar Papers

No similar papers found.

Authors to Follow