Development of Hybrid Artificial Intelligence Training on Real and Synthetic Data: Benchmark on Two Mixed Training Strategies

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Domain gap between synthetic and real data degrades model generalization. Method: This paper systematically investigates mixed training with real and synthetic data, proposing and comparing two hybrid strategies. We conduct ablation studies across three mainstream neural network architectures and multiple benchmark datasets, quantifying— for the first time—the impact of varying real-to-synthetic data ratios on model robustness and generalization. Contribution/Results: We demonstrate that an optimal ratio significantly narrows the domain gap, enhancing real-world performance without compromising labeling efficiency of real data. Crucially, this optimal ratio exhibits consistent patterns across architectures and datasets. Our findings provide a reproducible, methodology-driven framework and empirical foundation for efficient and reliable synthetic-data-augmented model training.

Technology Category

Application Category

📝 Abstract

Synthetic data has emerged as a cost-effective alternative to real data for training artificial neural networks (ANN). However, the disparity between synthetic and real data results in a domain gap. That gap leads to poor performance and generalization of the trained ANN when applied to real-world scenarios. Several strategies have been developed to bridge this gap, which combine synthetic and real data, known as mixed training using hybrid datasets. While these strategies have been shown to mitigate the domain gap, a systematic evaluation of their generalizability and robustness across various tasks and architectures remains underexplored. To address this challenge, our study comprehensively analyzes two widely used mixing strategies on three prevalent architectures and three distinct hybrid datasets. From these datasets, we sample subsets with varying proportions of synthetic to real data to investigate the impact of synthetic and real components. The findings of this paper provide valuable insights into optimizing the use of synthetic data in the training process of any ANN, contributing to enhancing robustness and efficacy.

Problem

Research questions and friction points this paper is trying to address.

Bridging domain gap between synthetic and real data for ANN training

Evaluating mixed training strategies' generalizability across tasks and architectures

Optimizing synthetic-real data proportions to enhance ANN robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid AI training with real and synthetic data

Evaluates two mixed training strategies systematically

Optimizes synthetic data use for ANN robustness

🔎 Similar Papers

No similar papers found.