🤖 AI Summary
To address degraded speech enhancement performance in complex acoustic environments with concurrent multiple interfering sources and reverberation, this paper proposes a cascaded joint modeling paradigm: “separation first, then dereverberation.” Methodologically, we introduce a novel two-stage deep network that fuses audio-visual features: the first stage leverages visual cues to guide target speaker separation, while the second stage performs targeted dereverberation in the time-frequency domain using attention mechanisms. The framework supports end-to-end joint optimization and exhibits strong modularity and extensibility. Evaluated on the AVSEC-4 challenge, our method achieves state-of-the-art performance—ranking first across all three objective metrics (STOI, PESQ, ESTOI) and attaining the highest subjective MOS score. It significantly improves speech clarity, naturalness, and intelligibility in realistic scenarios, demonstrating both the advancement and practicality of multimodal speech degradation modeling.
📝 Abstract
Audio-visual speech enhancement (AVSE) is a task that uses visual auxiliary information to extract a target speaker's speech from mixed audio. In real-world scenarios, there often exist complex acoustic environments, accompanied by various interfering sounds and reverberation. Most previous methods struggle to cope with such complex conditions, resulting in poor perceptual quality of the extracted speech. In this paper, we propose an effective AVSE system that performs well in complex acoustic environments. Specifically, we design a "separation before dereverberation" pipeline that can be extended to other AVSE networks. The 4th COGMHEAR Audio-Visual Speech Enhancement Challenge (AVSEC) aims to explore new approaches to speech processing in multimodal complex environments. We validated the performance of our system in AVSEC-4: we achieved excellent results in the three objective metrics on the competition leaderboard, and ultimately secured first place in the human subjective listening test.