Robust One-step Speech Enhancement via Consistency Distillation

๐Ÿ“… 2025-07-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Diffusion-based speech enhancement suffers from slow multi-step sampling, hindering real-time deployment; while consistency distillation enables single-step inference, student models often inherit trajectory biases from teachers, compromising robustness. This paper proposes ROSE-CD: a robust single-step enhancement framework that breaks reliance on fixed sampling paths via randomized trajectory learning and enhances denoising fidelity through a time-domain auxiliary lossโ€”enabling the student to surpass the teacher in performance. Evaluated on VoiceBank-DEMAND, ROSE-CD achieves state-of-the-art (SOTA) results, accelerates inference by 54ร— over the 30-step teacher, and demonstrates strong generalization to cross-domain and realistic noisy conditions. Its core contribution is the first integration of trajectory randomization and time-domain supervision into consistency distillation, significantly improving both robustness and accuracy of single-step diffusion models.

Technology Category

Application Category

๐Ÿ“ Abstract
Diffusion models have shown strong performance in speech enhancement, but their real-time applicability has been limited by multi-step iterative sampling. Consistency distillation has recently emerged as a promising alternative by distilling a one-step consistency model from a multi-step diffusion-based teacher model. However, distilled consistency models are inherently biased towards the sampling trajectory of the teacher model, making them less robust to noise and prone to inheriting inaccuracies from the teacher model. To address this limitation, we propose ROSE-CD: Robust One-step Speech Enhancement via Consistency Distillation, a novel approach for distilling a one-step consistency model. Specifically, we introduce a randomized learning trajectory to improve the model's robustness to noise. Furthermore, we jointly optimize the one-step model with two time-domain auxiliary losses, enabling it to recover from teacher-induced errors and surpass the teacher model in overall performance. This is the first pure one-step consistency distillation model for diffusion-based speech enhancement, achieving 54 times faster inference speed and superior performance compared to its 30-step teacher model. Experiments on the VoiceBank-DEMAND dataset demonstrate that the proposed model achieves state-of-the-art performance in terms of speech quality. Moreover, its generalization ability is validated on both an out-of-domain dataset and real-world noisy recordings.
Problem

Research questions and friction points this paper is trying to address.

Overcome multi-step iterative sampling in diffusion models
Reduce bias and noise sensitivity in distilled models
Improve robustness and accuracy in speech enhancement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Consistency distillation for one-step speech enhancement
Randomized learning trajectory improves noise robustness
Joint optimization with time-domain auxiliary losses
๐Ÿ”Ž Similar Papers
No similar papers found.