🤖 AI Summary
Environmental mismatch in real-world scenarios significantly degrades speaker verification performance. To address this, we propose an unsupervised speaker embedding enhancement method based on diffusion models—requiring no speaker labels and operating entirely decoupled from upstream speaker recognition systems. Our approach leverages the forward noise-corruption and reverse denoising processes of diffusion models to robustify pre-extracted speaker embeddings, while incorporating unsupervised embedding mapping learning to adapt to mismatched acoustic conditions. To the best of our knowledge, this is the first work to apply diffusion models to speaker embedding enhancement. Evaluated on standard environmental mismatch benchmarks, our method achieves up to a 19.6% absolute improvement in identification accuracy, without compromising performance under matched conditions—demonstrating both effectiveness and seamless compatibility with existing speaker verification pipelines.
📝 Abstract
A primary challenge when deploying speaker recognition systems in real-world applications is performance degradation caused by environmental mismatch. We propose a diffusion-based method that takes speaker embeddings extracted from a pre-trained speaker recognition model and generates refined embeddings. For training, our approach progressively adds Gaussian noise to both clean and noisy speaker embeddings extracted from clean and noisy speech, respectively, via forward process of a diffusion model, and then reconstructs them to clean embeddings in the reverse process. While inferencing, all embeddings are regenerated via diffusion process. Our method needs neither speaker label nor any modification to the existing speaker recognition pipeline. Experiments on evaluation sets simulating environment mismatch scenarios show that our method can improve recognition accuracy by up to 19.6% over baseline models while retaining performance on conventional scenarios. We publish our code here https://github.com/kaistmm/seed-pytorch