DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis

๐Ÿ“… 2024-10-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address three key bottlenecks in diffusion-based text-to-speech (TTS)โ€”slow inference, degraded audio quality after distillation, and the inability to end-to-end optimize perceptual metricsโ€”this paper proposes the first zero-shot TTS framework enabling joint differentiable optimization of auditory metrics: CTC-based phoneme alignment and speaker verification (SV). Methodologically, we design a lightweight knowledge distillation diffusion architecture that enables gradient flow through all modules. Crucially, we integrate CTC loss and SV loss directly into the end-to-end training objective, replacing non-differentiable components and iterative sampling procedures. Experimental results demonstrate that the distilled student model significantly outperforms the teacher model across naturalness, intelligibility, and speaker similarity (human evaluation, *p* < 0.01), while achieving a 100ร—โ€“1000ร— inference speedup.

Technology Category

Application Category

๐Ÿ“ Abstract
Diffusion models have demonstrated significant potential in speech synthesis tasks, including text-to-speech (TTS) and voice cloning. However, their iterative denoising processes are computationally intensive, and previous distillation attempts have shown consistent quality degradation. Moreover, existing TTS approaches are limited by non-differentiable components or iterative sampling that prevent true end-to-end optimization with perceptual metrics. We introduce DMOSpeech, a distilled diffusion-based TTS model that uniquely achieves both faster inference and superior performance compared to its teacher model. By enabling direct gradient pathways to all model components, we demonstrate the first successful end-to-end optimization of differentiable metrics in TTS, incorporating Connectionist Temporal Classification (CTC) loss and Speaker Verification (SV) loss. Our comprehensive experiments, validated through extensive human evaluation, show significant improvements in naturalness, intelligibility, and speaker similarity while reducing inference time by orders of magnitude. This work establishes a new framework for aligning speech synthesis with human auditory preferences through direct metric optimization. The audio samples are available at https://dmospeech.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Optimizes speech synthesis via distilled diffusion model
Reduces computational cost without quality degradation
Enables end-to-end optimization with perceptual metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distilled diffusion model for TTS
End-to-end differentiable metric optimization
CTC and SV loss integration