Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World

📅 2024-10-22
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Do generative models inevitably suffer “model collapse” during large-scale pretraining with early-stage synthetic data? This paper systematically compares three synthetic-data training paradigms—replacement, accumulation, and constrained subset iteration—across Gaussian estimation, kernel density estimation, and language model fine-tuning. Methodologically, it introduces a generational iterative training framework, a multi-task benchmark suite, and dynamic test-loss modeling. Results demonstrate that “accumulation + full-dataset training” completely avoids collapse (test loss remains stable), whereas “constrained subset iteration” induces progressive performance degradation, and pure replacement inevitably collapses. These findings refute the monolithic assumption that synthetic data inherently causes collapse, instead establishing “data-evolution path dependence” as a new paradigm and empirically delineating safe operational boundaries for synthetic-data utilization.

Technology Category

Application Category

📝 Abstract
What happens when generative machine learning models are pretrained on web-scale datasets containing data generated by earlier models? Some prior work warns of"model collapse"as the web is overwhelmed by synthetic data; other work suggests the problem can be contained (i.e. collapse can be avoided) by managing how available data are used in pretraining. In this paper, we report experiments on three ways of using data (training-workflows), across three generative model task-settings (multivariate Gaussian estimation, kernel density estimation, and language-model fine-tuning) to further confirm the possibility of containment: (a) we confirm that the training-workflow of {it replacing} all real data by successive generations of purely synthetic data indeed suffers model collapse in all task-settings studied; (b) we consider the training-workflow of {it accumulating} synthetic data alongside real data and training on all data combined and confirming that, although the proportion of real data eventually becomes zero, models remain stable and their test losses do not diverge under this training-workflow; (c) we consider a training-workflow where real and synthetic data accumulate together but successive generations of pretraining are constrained to use fixed-size data subsets each generation. In this workflow, we observe slow and gradual rather than explosive degradation of test loss performance across generations. Our insights are particularly important when forecasting whether future frontier generative models will collapse or thrive, and our results open avenues for empirically and mathematically studying the context-dependent value of synthetic data.
Problem

Research questions and friction points this paper is trying to address.

Impact of synthetic data on model collapse
Strategies to manage synthetic data in training
Long-term stability of generative models with synthetic data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Replacing real with synthetic data
Accumulating synthetic alongside real data
Constraining data subsets each generation
🔎 Similar Papers
No similar papers found.