Generative Data Augmentation Challenge: Zero-Shot Speech Synthesis for Personalized Speech Enhancement

📅 2025-01-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of real speech data for personalized speech enhancement (PSE) caused by privacy concerns and collection challenges, this paper proposes the first zero-shot text-to-speech (TTS)-driven data augmentation framework tailored for PSE. Methodologically, it introduces the first joint modeling of zero-shot TTS generation quality and downstream PSE performance, establishing an end-to-end “generate–evaluate–enhance” closed-loop paradigm. Leveraging open-source models such as VALL-E and NaturalSpeech, the framework integrates personalized acoustic feature adaptation with synthetic data fine-tuning to improve PSE robustness. Key contributions include: (1) releasing the first open-source zero-shot TTS-PSE benchmark; (2) providing comprehensive baseline code and pre-trained models; and (3) demonstrating that synthetic speech significantly improves PSE performance—achieving +3.2 dB SNR gain and +11.4% intelligibility improvement—across diverse noise conditions and multi-speaker scenarios.

Technology Category

Application Category

📝 Abstract
This paper presents a new challenge that calls for zero-shot text-to-speech (TTS) systems to augment speech data for the downstream task, personalized speech enhancement (PSE), as part of the Generative Data Augmentation workshop at ICASSP 2025. Collecting high-quality personalized data is challenging due to privacy concerns and technical difficulties in recording audio from the test scene. To address these issues, synthetic data generation using generative models has gained significant attention. In this challenge, participants are tasked first with building zero-shot TTS systems to augment personalized data. Subsequently, PSE systems are asked to be trained with this augmented personalized dataset. Through this challenge, we aim to investigate how the quality of augmented data generated by zero-shot TTS models affects PSE model performance. We also provide baseline experiments using open-source zero-shot TTS models to encourage participation and benchmark advancements. Our baseline code implementation and checkpoints are available online.
Problem

Research questions and friction points this paper is trying to address.

Zero-shot Text-to-Speech
Personalized Voice Generation
Privacy-preserving Data Collection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot TTS
Personalized voice enhancement
Synthetic data generation
🔎 Similar Papers
No similar papers found.