π€ AI Summary
This work addresses the instability of existing speaker-aware speech enhancement methods, which either rely on clean enrollment audio or utilize noisy speaker embeddings vulnerable to acoustic noise and domain shifts. To overcome this limitation, the authors propose a prior-matching mechanism that operates without requiring clean reference audio during inference. The approach first constructs a Gaussian Mixture Model (GMM)-based prior over clean speaker embeddings, then refines noisy embeddings through prior matching. These enhanced embeddings are subsequently integrated into a time-frequency domain enhancement backbone via a lightweight gated fusion module to guide speaker-aware denoising. The method substantially narrows the performance gap with the ideal upper bound under clean conditions and outperforms current noisy-condition approaches on both the VoiceBank+DEMAND and DNS Challenge 2020 datasets, demonstrating superior robustness.
π Abstract
Using speaker embeddings as conditioning can strengthen speech enhancement, but most methods either require clean enrollment audio or rely on embeddings extracted from noisy speech, which are fragile under noise and domain shift. We propose G-MaP-SE, a guided enhancement framework that builds a clean-speech embedding prior with a Gaussian Mixture Model (GMM) and refines a noisy conditioning embedding by matching it to this prior. The matched prior embedding is then injected into a time-frequency enhancement backbone via a lightweight gated fusion module. Experiments on VoiceBank+DEMAND and DNS Challenge 2020 datasets show that the proposed prior matching consistently outperforms noisy conditioning and substantially narrows the gap to an oracle clean-conditioning upper bound, while requiring no enrollment audio at inference time. The code, audio samples, and checkpoint are available.