🤖 AI Summary
Domain gap between synthetic and real data degrades model generalization. Method: This paper systematically investigates mixed training with real and synthetic data, proposing and comparing two hybrid strategies. We conduct ablation studies across three mainstream neural network architectures and multiple benchmark datasets, quantifying— for the first time—the impact of varying real-to-synthetic data ratios on model robustness and generalization. Contribution/Results: We demonstrate that an optimal ratio significantly narrows the domain gap, enhancing real-world performance without compromising labeling efficiency of real data. Crucially, this optimal ratio exhibits consistent patterns across architectures and datasets. Our findings provide a reproducible, methodology-driven framework and empirical foundation for efficient and reliable synthetic-data-augmented model training.
📝 Abstract
Synthetic data has emerged as a cost-effective alternative to real data for training artificial neural networks (ANN). However, the disparity between synthetic and real data results in a domain gap. That gap leads to poor performance and generalization of the trained ANN when applied to real-world scenarios. Several strategies have been developed to bridge this gap, which combine synthetic and real data, known as mixed training using hybrid datasets. While these strategies have been shown to mitigate the domain gap, a systematic evaluation of their generalizability and robustness across various tasks and architectures remains underexplored. To address this challenge, our study comprehensively analyzes two widely used mixing strategies on three prevalent architectures and three distinct hybrid datasets. From these datasets, we sample subsets with varying proportions of synthetic to real data to investigate the impact of synthetic and real components. The findings of this paper provide valuable insights into optimizing the use of synthetic data in the training process of any ANN, contributing to enhancing robustness and efficacy.