Generating Training Targets for Real-World Speech Enhancement via Close-to-Distant Microphone Projection

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Real-world far-field speech enhancement is hindered by the scarcity of paired distorted-clean data and the domain shift between simulated and real recordings. This work proposes a Close-to-Distant (C2D) projection method that leverages simultaneously recorded near- and far-field dual-microphone signals to automatically generate high-quality clean reference signals aligned with far-field recordings. By estimating an optimal projection matrix via a parametric multichannel Wiener filter (PMWF), C2D enables, for the first time, the construction of training targets from real recordings without manual annotations. Evaluated on the CHiME-6 dinner party scenario ASR task, the proposed approach outperforms the current state-of-the-art Guided Source Separation (GSS) method under known speaker segmentation conditions, and further performance gains are achieved when fused with GSS.

📝 Abstract

Training neural networks (NNs) for speech enhancement (SE) in distant speech-capturing scenarios requires paired distorted and clean reference speech signals. While such data are often generated through simulation, the mismatch between simulated and real recordings significantly limits SE accuracy. To address this issue, we propose Close-to-Distant microphone Projection (C2D projection), a method that generates paired data from real recordings captured by close and distant microphones. C2D projection estimates an optimal projection matrix that transforms close-microphone inputs into clean reference signals aligned with distant-microphone recordings, while simultaneously performing denoising. We show this projection can be effectively realized using a variant of the Parametric Multichannel Wiener Filter (PMWF). Experimental results demonstrate that an NN trained with C2D-projected data outperforms the state-of-the-art Guided Source Separation (GSS) on the challenging CHiME6 dinner party ASR task under oracle diarization, when using the enhanced output from GSS as an auxiliary input to the NN.

Problem

Research questions and friction points this paper is trying to address.

speech enhancement

distant speech

training data mismatch

real-world recordings

clean reference signals

Innovation

Methods, ideas, or system contributions that make the work stand out.

Close-to-Distant microphone Projection

Speech Enhancement

Real-world Data