Not All Synthetic Data Is Yours to Learn From

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This study investigates whether language models can effectively self-train using only unprompted, model-generated text and elucidates the underlying mechanisms governing the efficacy of synthetic data. Under fully unsupervised conditions, base models such as Pythia undergo unconditional self-training, with learning outcomes evaluated through semantic similarity and likelihood-based analyses. The work proposes a “latent capability re-emergence” hypothesis, arguing that the utility of synthetic data hinges on compatibility between the source and student models rather than intrinsic properties of the data itself. Empirical results demonstrate that self-training with data from the same model family yields optimal performance, significantly outperforming training with data from stronger but heterogeneous models. Crucially, the study shows that model capabilities can be preserved or even enhanced without explicit unlearning mechanisms, while verbatim memorization extraction rates drop by over 95%, indicating that capability retention and memory suppression can naturally decouple.

📝 Abstract

Can a language model improve from plain text sampled from itself, with no prompts, no teacher, no verifier, and no reward model? Yes, but only when the synthetic corpus is compatible with the student, a relational property of the source-student pair rather than an intrinsic property of the data. We call this the latent capability resurfacing hypothesis: weak self-training can amplify capabilities already present in the pretrained model, but only under this compatibility condition. We study this in the minimal setting of prompt-free unconditional self-training, where base language models are fine-tuned on text generated from the BOS token alone, with no task specification or external supervision. We report three findings. First, synthetic utility is relational rather than intrinsic: self-generated data is the most effective source, same-lineage transfer outperforms stronger but differently trained sources, and cross-family transfer is substantially weaker. Second, common intrinsic proxies fail: neither benchmark-level semantic similarity nor average per-token likelihood under the student predicts which corpora help. Third, this regime produces a surprising byproduct. In controlled Pythia experiments, capability and verbatim memorization decouple: benchmark utility is preserved or improved while held-out exact-match extraction drops by over 95 percent, with no forget set, privacy objective, or targeted unlearning. Together, these results suggest that prompt-free self-training works by amplifying what the student already knows, not by importing structure from the data. They also reveal a regime in which capability and verbatim memorization can be separated without any explicit unlearning objective.

Problem

Research questions and friction points this paper is trying to address.

synthetic data

self-training

language models

capability resurfacing

verbatim memorization

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-training

synthetic data compatibility

latent capability resurfacing