TTSOps: A Closed-Loop Corpus Optimization Framework for Training Multi-Speaker TTS Models from Dark Data

πŸ“… 2025-06-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the challenges of constructing multi-speaker text-to-speech (TTS) systems from noisy, sparsely annotated web-sourced speech data, this paper proposes a fully automated, closed-loop corpus construction framework. Methodologically, it integrates quality-aware dynamic cleaning, phoneme-level adaptive filtering, a lightweight MOS predictor for real-time sample selection, and joint model-corpus co-optimization to iteratively enhance both corpus quality and model performance online. Its key innovation lies in the first incorporation of MOS prediction into the data cleaning feedback loopβ€”thereby balancing noise robustness with retention of low-quality yet information-rich utterances. Evaluated on Japanese YouTube speech data, the framework achieves significant improvements over conventional acoustic-quality-based baselines: +0.8 MOS in synthesized speech naturalness and +37% increase in speaker coverage, demonstrating superior speaker diversity and audio fidelity.

Technology Category

Application Category

πŸ“ Abstract
This paper presents TTSOps, a fully automated closed-loop framework for constructing multi-speaker text-to-speech (TTS) systems from noisy, uncurated web-scale speech data, often referred to as ``dark data,'' such as online videos. Conventional TTS training pipelines require well-curated corpora with high acoustic quality and accurate text-speech alignment, which severely limits scalability, speaker diversity, and real-world applicability. While recent studies have proposed acoustic-quality-based data selection techniques, they often overlook two critical aspects: (1) the inherent robustness of modern TTS models to noise, and (2) the potential contribution of perceptually low-quality yet informative samples. To address these issues, TTSOps introduces a data-centric training pipeline that integrates three core components: (1) automated data collection from dark data sources, (2) utterance-level dynamic selection of data cleansing methods based on training data quality, and (3) evaluation-in-the-loop data selection using automatically predicted mean opinion scores (MOS) to estimate each utterance's impact on model performance. Furthermore, TTSOps jointly optimizes the corpus and the TTS model in a closed-loop framework by dynamically adapting both data selection and data cleansing processes to the characteristics of the target TTS model. Extensive experiments on Japanese YouTube data demonstrate that TTSOps outperforms conventional acoustic-quality-based baselines in both the naturalness and speaker diversity of synthesized speech.
Problem

Research questions and friction points this paper is trying to address.

Automates multi-speaker TTS training from noisy web data
Addresses limitations of curated data in TTS scalability
Optimizes data selection and cleansing for model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated data collection from dark sources
Dynamic selection of data cleansing methods
Closed-loop corpus and model joint optimization
πŸ”Ž Similar Papers
No similar papers found.