Unsupervised Blind Speech Separation with a Diffusion Prior

📅 2025-05-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Blind speech separation (BSS) confronts the fundamental challenge of unknown microphone array geometry and room impulse responses (RIRs), rendering it a canonical blind acoustic inverse problem. This paper introduces ArrayDPS—the first framework to extend diffusion posterior sampling (DPS) to blind BSS—requiring only the observed mixtures and a single-speaker diffusion model prior, with no assumptions about array configuration or RIR statistics. Its core innovation is a learnable likelihood approximation that jointly optimizes acoustic transfer functions and room acoustics, enabling array-agnostic, unsupervised, and generative source separation. ArrayDPS consistently outperforms existing unsupervised baselines in signal-to-distortion ratio (SDR), achieving performance on par with supervised methods. Crucially, it demonstrates robust generalization to real-world recordings without fine-tuning, validating its practical applicability in realistic acoustic environments.

Technology Category

Application Category

📝 Abstract
Blind Speech Separation (BSS) aims to separate multiple speech sources from audio mixtures recorded by a microphone array. The problem is challenging because it is a blind inverse problem, i.e., the microphone array geometry, the room impulse response (RIR), and the speech sources, are all unknown. We propose ArrayDPS to solve the BSS problem in an unsupervised, array-agnostic, and generative manner. The core idea builds on diffusion posterior sampling (DPS), but unlike DPS where the likelihood is tractable, ArrayDPS must approximate the likelihood by formulating a separate optimization problem. The solution to the optimization approximates room acoustics and the relative transfer functions between microphones. These approximations, along with the diffusion priors, iterate through the ArrayDPS sampling process and ultimately yield separated voice sources. We only need a simple single-speaker speech diffusion model as a prior along with the mixtures recorded at the microphones; no microphone array information is necessary. Evaluation results show that ArrayDPS outperforms all baseline unsupervised methods while being comparable to supervised methods in terms of SDR. Audio demos are provided at: https://arraydps.github.io/ArrayDPSDemo/.
Problem

Research questions and friction points this paper is trying to address.

Separate multiple speech sources from audio mixtures blindly
Solve BSS without knowing microphone array geometry or room acoustics
Approximate likelihood and room acoustics using diffusion prior
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses diffusion posterior sampling (DPS) for speech separation
Approximates room acoustics via separate optimization problem
Requires only single-speaker diffusion model and microphone mixtures
🔎 Similar Papers
No similar papers found.