TTSOps: A Closed-Loop Corpus Optimization Framework for Training Multi-Speaker TTS Models from Dark Data

📅 2025-06-18

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address the challenges of constructing multi-speaker text-to-speech (TTS) systems from noisy, sparsely annotated web-sourced speech data, this paper proposes a fully automated, closed-loop corpus construction framework. Methodologically, it integrates quality-aware dynamic cleaning, phoneme-level adaptive filtering, a lightweight MOS predictor for real-time sample selection, and joint model-corpus co-optimization to iteratively enhance both corpus quality and model performance online. Its key innovation lies in the first incorporation of MOS prediction into the data cleaning feedback loop—thereby balancing noise robustness with retention of low-quality yet information-rich utterances. Evaluated on Japanese YouTube speech data, the framework achieves significant improvements over conventional acoustic-quality-based baselines: +0.8 MOS in synthesized speech naturalness and +37% increase in speaker coverage, demonstrating superior speaker diversity and audio fidelity.

Technology Category

Application Category

📝 Abstract

This paper presents TTSOps, a fully automated closed-loop framework for constructing multi-speaker text-to-speech (TTS) systems from noisy, uncurated web-scale speech data, often referred to as ``dark data,'' such as online videos. Conventional TTS training pipelines require well-curated corpora with high acoustic quality and accurate text-speech alignment, which severely limits scalability, speaker diversity, and real-world applicability. While recent studies have proposed acoustic-quality-based data selection techniques, they often overlook two critical aspects: (1) the inherent robustness of modern TTS models to noise, and (2) the potential contribution of perceptually low-quality yet informative samples. To address these issues, TTSOps introduces a data-centric training pipeline that integrates three core components: (1) automated data collection from dark data sources, (2) utterance-level dynamic selection of data cleansing methods based on training data quality, and (3) evaluation-in-the-loop data selection using automatically predicted mean opinion scores (MOS) to estimate each utterance's impact on model performance. Furthermore, TTSOps jointly optimizes the corpus and the TTS model in a closed-loop framework by dynamically adapting both data selection and data cleansing processes to the characteristics of the target TTS model. Extensive experiments on Japanese YouTube data demonstrate that TTSOps outperforms conventional acoustic-quality-based baselines in both the naturalness and speaker diversity of synthesized speech.

Problem

Research questions and friction points this paper is trying to address.

Automates multi-speaker TTS training from noisy web data

Addresses limitations of curated data in TTS scalability

Optimizes data selection and cleansing for model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated data collection from dark sources

Dynamic selection of data cleansing methods

Closed-loop corpus and model joint optimization

🔎 Similar Papers

GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement