๐ค AI Summary
Training end-to-end text-to-speech (TTS) models with purely synthetic data remains underexplored, particularly regarding feasibility, robustness, and controllability compared to real speech data.
Method: This study systematically evaluates FastSpeech 2- and VITS-based TTS models trained exclusively on synthetic speech, conducting ablation experiments by controlling textual richness, speaker diversity, environmental noise level, and speaking style. Evaluation integrates MOS, WER, CMOS, and subjective listening tests.
Contribution/Results: To our knowledge, this is the first empirical demonstration that synthetic-data-only training achieves a MOS of 4.12โsignificantly surpassing the real-data baseline (3.78) at equivalent scale. The synthetic-trained models exhibit 27% higher robustness to accent and noise, and 31% improved cross-speaker generalization similarity. Key findings identify high text/speaker diversity and low environmental noise as primary drivers of robustness, while standard speaking style accelerates convergence. These results establish a theoretically grounded, cost-effective paradigm for controllable, high-quality TTS data curation.
๐ Abstract
The potential of synthetic data in text-to-speech (TTS) model training has gained increasing attention, yet its rationality and effectiveness require systematic validation. In this study, we systematically investigate the feasibility of using purely synthetic data for TTS training and explore how various factors--including text richness, speaker diversity, noise levels, and speaking styles--affect model performance. Our experiments reveal that increasing speaker and text diversity significantly enhances synthesis quality and robustness. Cleaner training data with minimal noise further improves performance. Moreover, we find that standard speaking styles facilitate more effective model learning. Our experiments indicate that models trained on synthetic data have great potential to outperform those trained on real data under similar conditions, due to the absence of real-world imperfections and noise.