Beyond Real Data: Synthetic Data through the Lens of Regularization

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

In data-scarce real-world scenarios, synthetic data can enhance model generalization, yet excessive incorporation degrades performance due to distributional shift—e.g., increased Wasserstein distance—between synthetic and real domains. Method: We propose the first analytical framework grounded in algorithmic stability and regularization theory to quantify how the mixture ratio of synthetic to real data affects generalization error. Our analysis reveals, for the first time, a U-shaped relationship between test error and synthetic data proportion, and derives the theoretically optimal mixing ratio. Crucially, we incorporate the Wasserstein distance into the generalization bound for kernel ridge regression, extending it to domain adaptation settings. Results: Experiments on CIFAR-10 and clinical brain MRI datasets validate our theory: models trained with the predicted optimal ratio achieve significantly lower test error and demonstrate improved robustness and generalization—both in-domain and cross-domain.

Technology Category

Application Category

📝 Abstract

Synthetic data can improve generalization when real data is scarce, but excessive reliance may introduce distributional mismatches that degrade performance. In this paper, we present a learning-theoretic framework to quantify the trade-off between synthetic and real data. Our approach leverages algorithmic stability to derive generalization error bounds, characterizing the optimal synthetic-to-real data ratio that minimizes expected test error as a function of the Wasserstein distance between the real and synthetic distributions. We motivate our framework in the setting of kernel ridge regression with mixed data, offering a detailed analysis that may be of independent interest. Our theory predicts the existence of an optimal ratio, leading to a U-shaped behavior of test error with respect to the proportion of synthetic data. Empirically, we validate this prediction on CIFAR-10 and a clinical brain MRI dataset. Our theory extends to the important scenario of domain adaptation, showing that carefully blending synthetic target data with limited source data can mitigate domain shift and enhance generalization. We conclude with practical guidance for applying our results to both in-domain and out-of-domain scenarios.

Problem

Research questions and friction points this paper is trying to address.

Quantifying the trade-off between synthetic and real data usage

Determining optimal synthetic-to-real data ratio for generalization

Mitigating domain shift through strategic synthetic data blending

Innovation

Methods, ideas, or system contributions that make the work stand out.

Algorithmic stability framework for data blending

Optimal synthetic-real ratio via Wasserstein distance

Domain adaptation through mixed data integration

🔎 Similar Papers

Machine Learning for Synthetic Data Generation: a Review