Generative Data Augmentation Challenge: Zero-Shot Speech Synthesis for Personalized Speech Enhancement

📅 2025-01-23

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

To address the scarcity of real speech data for personalized speech enhancement (PSE) caused by privacy concerns and collection challenges, this paper proposes the first zero-shot text-to-speech (TTS)-driven data augmentation framework tailored for PSE. Methodologically, it introduces the first joint modeling of zero-shot TTS generation quality and downstream PSE performance, establishing an end-to-end “generate–evaluate–enhance” closed-loop paradigm. Leveraging open-source models such as VALL-E and NaturalSpeech, the framework integrates personalized acoustic feature adaptation with synthetic data fine-tuning to improve PSE robustness. Key contributions include: (1) releasing the first open-source zero-shot TTS-PSE benchmark; (2) providing comprehensive baseline code and pre-trained models; and (3) demonstrating that synthetic speech significantly improves PSE performance—achieving +3.2 dB SNR gain and +11.4% intelligibility improvement—across diverse noise conditions and multi-speaker scenarios.

Technology Category

Application Category

📝 Abstract

This paper presents a new challenge that calls for zero-shot text-to-speech (TTS) systems to augment speech data for the downstream task, personalized speech enhancement (PSE), as part of the Generative Data Augmentation workshop at ICASSP 2025. Collecting high-quality personalized data is challenging due to privacy concerns and technical difficulties in recording audio from the test scene. To address these issues, synthetic data generation using generative models has gained significant attention. In this challenge, participants are tasked first with building zero-shot TTS systems to augment personalized data. Subsequently, PSE systems are asked to be trained with this augmented personalized dataset. Through this challenge, we aim to investigate how the quality of augmented data generated by zero-shot TTS models affects PSE model performance. We also provide baseline experiments using open-source zero-shot TTS models to encourage participation and benchmark advancements. Our baseline code implementation and checkpoints are available online.

Problem

Research questions and friction points this paper is trying to address.

Zero-shot Text-to-Speech

Personalized Voice Generation

Privacy-preserving Data Collection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot TTS

Personalized voice enhancement

Synthetic data generation

🔎 Similar Papers

No similar papers found.