๐ค AI Summary
Existing binaural audio synthesis methods for remote conferencing are limited by either single-microphone setups lacking spatial cues or microphone arrays relying on high-accuracy direction-of-arrival (DOA) estimation. This paper proposes the first noise-robust, end-to-end array-to-binaural (Array2BR) mapping framework that directly synthesizes high-fidelity, low-noise binaural audio from multi-channel array signals. Our approach jointly models interaural time/level differences (ITD/ILD) and noise suppression in a unified mannerโwithout requiring source separation or post-processing. It integrates beamforming priors with binaural transfer function constraints, employs time-frequency joint deep modeling, and optimizes via multi-scale loss functions. Experiments demonstrate significant improvements over state-of-the-art methods: +1.22 PESQ, +4.7% STOI, and +0.8 MOS; achieves 18.2 dB noise reduction; and reduces ITD/ILD estimation errors by 37%.
๐ Abstract
Telepresence technology aims to provide an immersive virtual presence for remote conference applications, and it is extremely important to synthesize high-quality binaural audio signals for this aim. Because the ambient noise is often inevitable in practical application scenarios, it is highly desired that binaural audio signals without noise can be obtained from microphone-array signals directly. For this purpose, this paper proposes a new end-to-end noise-immune binaural audio synthesis framework from microphone-array signals, abbreviated as Array2BR, and experimental results show that binaural cues can be correctly mapped and noise can be well suppressed simultaneously using the proposed framework. Compared with existing methods, the proposed method achieved better performance in terms of both objective and subjective metric scores.