🤖 AI Summary
This work addresses the challenge of speech enhancement under low signal-to-noise ratio (SNR) conditions by proposing a novel approach based on an absorbing discrete diffusion model. The method models the conditional distribution of clean speech codes in the expressive latent space of a neural audio codec in a non-autoregressive manner and introduces, for the first time, an absorbing discrete diffusion mechanism to the speech enhancement task. To this end, the authors design the RQDiT architecture, which integrates residual vector quantization (RQ) with a diffusion Transformer to effectively capture the hierarchical structure of RQ-encoded representations. Experimental results demonstrate that the proposed method achieves competitive non-intrusive objective scores on two benchmark datasets, particularly excelling in low-SNR scenarios and with very few sampling steps.
📝 Abstract
Inspired by recent developments in neural speech coding and diffusion-based language modeling, we tackle speech enhancement by modeling the conditional distribution of clean speech codes given noisy speech codes using absorbing discrete diffusion. The proposed approach, which we call ADDSE, leverages both the expressive latent space of neural audio codecs and the non-autoregressive sampling procedure of diffusion models. To efficiently model the hierarchical structure of residual vector quantization codes, we propose RQDiT, which combines techniques from RQ-Transformer and diffusion Transformers for non-autoregressive modeling. Results show competitive performance in terms of non-intrusive objective metrics on two datasets, especially at low signal-to-noise ratios and with few sampling steps. Code and audio examples are available online.