Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation

πŸ“… 2024-09-14
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Diffusion-based zero-shot text-to-speech (TTS) suffers from unstable formants and phoneme mispronunciations due to the absence of articulatory physical constraints. To address this, we propose StableForm-TTSβ€”the first diffusion TTS framework incorporating source-filter theory to enforce biophysically grounded articulatory constraints for stable formant modeling. Methodologically, we design an acoustically decoupled conditional diffusion model, introduce a formant-aware loss, and employ hierarchical time-frequency feature guidance; further, we adopt a speaker-agnostic zero-shot adaptation strategy. Experiments demonstrate that StableForm-TTS improves phoneme accuracy by 12.6% on unseen speakers, increases naturalness (MOS) by 0.42, and maintains high speaker similarity. Moreover, performance consistently scales with both data volume and model size, confirming its robustness and extensibility.

Technology Category

Application Category

πŸ“ Abstract
Diffusion models have achieved remarkable success in text-to-speech (TTS), even in zero-shot scenarios. Recent efforts aim to address the trade-off between inference speed and sound quality, often considered the primary drawback of diffusion models. However, we find a critical mispronunciation issue is being overlooked. Our preliminary study reveals the unstable pronunciation resulting from the diffusion process. Based on this observation, we introduce StableForm-TTS, a novel zero-shot speech synthesis framework designed to produce robust pronunciation while maintaining the advantages of diffusion modeling. By pioneering the adoption of source-filter theory in diffusion TTS, we propose an elaborate architecture for stable formant generation. Experimental results on unseen speakers show that our model outperforms the state-of-the-art method in terms of pronunciation accuracy and naturalness, with comparable speaker similarity. Moreover, our model demonstrates effective scalability as both data and model sizes increase. Audio samples are available online: https://deepbrainai-research.github.io/stableformtts/.
Problem

Research questions and friction points this paper is trying to address.

Text-to-Speech
Stability and Accuracy
Diffusion Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

StableForm-TTS
Enhanced Acoustic Feature Stability
Diffusion Model Optimization
πŸ”Ž Similar Papers
No similar papers found.
C
Changjin Han
DeepBrain AI, Seoul, South Korea
S
Seokgi Lee
DeepBrain AI, Seoul, South Korea
G
Gyuhyeon Nam
DeepBrain AI, Seoul, South Korea
G
Gyeongsu Chae
DeepBrain AI, Seoul, South Korea