Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World

📅 2024-10-22

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Do generative models inevitably suffer “model collapse” during large-scale pretraining with early-stage synthetic data? This paper systematically compares three synthetic-data training paradigms—replacement, accumulation, and constrained subset iteration—across Gaussian estimation, kernel density estimation, and language model fine-tuning. Methodologically, it introduces a generational iterative training framework, a multi-task benchmark suite, and dynamic test-loss modeling. Results demonstrate that “accumulation + full-dataset training” completely avoids collapse (test loss remains stable), whereas “constrained subset iteration” induces progressive performance degradation, and pure replacement inevitably collapses. These findings refute the monolithic assumption that synthetic data inherently causes collapse, instead establishing “data-evolution path dependence” as a new paradigm and empirically delineating safe operational boundaries for synthetic-data utilization.

Technology Category

Application Category

📝 Abstract

What happens when generative machine learning models are pretrained on web-scale datasets containing data generated by earlier models? Some prior work warns of"model collapse"as the web is overwhelmed by synthetic data; other work suggests the problem can be contained (i.e. collapse can be avoided) by managing how available data are used in pretraining. In this paper, we report experiments on three ways of using data (training-workflows), across three generative model task-settings (multivariate Gaussian estimation, kernel density estimation, and language-model fine-tuning) to further confirm the possibility of containment: (a) we confirm that the training-workflow of {it replacing} all real data by successive generations of purely synthetic data indeed suffers model collapse in all task-settings studied; (b) we consider the training-workflow of {it accumulating} synthetic data alongside real data and training on all data combined and confirming that, although the proportion of real data eventually becomes zero, models remain stable and their test losses do not diverge under this training-workflow; (c) we consider a training-workflow where real and synthetic data accumulate together but successive generations of pretraining are constrained to use fixed-size data subsets each generation. In this workflow, we observe slow and gradual rather than explosive degradation of test loss performance across generations. Our insights are particularly important when forecasting whether future frontier generative models will collapse or thrive, and our results open avenues for empirically and mathematically studying the context-dependent value of synthetic data.

Problem

Research questions and friction points this paper is trying to address.

Impact of synthetic data on model collapse

Strategies to manage synthetic data in training

Long-term stability of generative models with synthetic data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Replacing real with synthetic data

Accumulating synthetic alongside real data

Constraining data subsets each generation

🔎 Similar Papers

Machine Learning for Synthetic Data Generation: a Review