Array2BR: An End-to-End Noise-immune Binaural Audio Synthesis from Microphone-array Signals

๐Ÿ“… 2024-10-08
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing binaural audio synthesis methods for remote conferencing are limited by either single-microphone setups lacking spatial cues or microphone arrays relying on high-accuracy direction-of-arrival (DOA) estimation. This paper proposes the first noise-robust, end-to-end array-to-binaural (Array2BR) mapping framework that directly synthesizes high-fidelity, low-noise binaural audio from multi-channel array signals. Our approach jointly models interaural time/level differences (ITD/ILD) and noise suppression in a unified mannerโ€”without requiring source separation or post-processing. It integrates beamforming priors with binaural transfer function constraints, employs time-frequency joint deep modeling, and optimizes via multi-scale loss functions. Experiments demonstrate significant improvements over state-of-the-art methods: +1.22 PESQ, +4.7% STOI, and +0.8 MOS; achieves 18.2 dB noise reduction; and reduces ITD/ILD estimation errors by 37%.

Technology Category

Application Category

๐Ÿ“ Abstract
Telepresence technology aims to provide an immersive virtual presence for remote conference applications, and it is extremely important to synthesize high-quality binaural audio signals for this aim. Because the ambient noise is often inevitable in practical application scenarios, it is highly desired that binaural audio signals without noise can be obtained from microphone-array signals directly. For this purpose, this paper proposes a new end-to-end noise-immune binaural audio synthesis framework from microphone-array signals, abbreviated as Array2BR, and experimental results show that binaural cues can be correctly mapped and noise can be well suppressed simultaneously using the proposed framework. Compared with existing methods, the proposed method achieved better performance in terms of both objective and subjective metric scores.
Problem

Research questions and friction points this paper is trying to address.

Enhancing remote conferencing with clear and spatial audio
Overcoming limitations in existing speaker extraction methods
Improving accuracy of spatial rendering in binaural speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end deep learning framework for speech
Unifies extraction, noise suppression, binaural rendering
Magnitude-weighted ILD loss improves spatial accuracy
Cheng Chi
Cheng Chi
Columbia University, Stanford University
robotics
X
Xiaoyu Li
Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China; Communication University of China, Beijing, China
A
Andong Li
Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
Y
Yuxuan Ke
Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
X
Xiaodong Li
Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
C
C. Zheng
Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China