Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address degraded speech enhancement performance in complex acoustic environments with concurrent multiple interfering sources and reverberation, this paper proposes a cascaded joint modeling paradigm: “separation first, then dereverberation.” Methodologically, we introduce a novel two-stage deep network that fuses audio-visual features: the first stage leverages visual cues to guide target speaker separation, while the second stage performs targeted dereverberation in the time-frequency domain using attention mechanisms. The framework supports end-to-end joint optimization and exhibits strong modularity and extensibility. Evaluated on the AVSEC-4 challenge, our method achieves state-of-the-art performance—ranking first across all three objective metrics (STOI, PESQ, ESTOI) and attaining the highest subjective MOS score. It significantly improves speech clarity, naturalness, and intelligibility in realistic scenarios, demonstrating both the advancement and practicality of multimodal speech degradation modeling.

Technology Category

Application Category

📝 Abstract

Audio-visual speech enhancement (AVSE) is a task that uses visual auxiliary information to extract a target speaker's speech from mixed audio. In real-world scenarios, there often exist complex acoustic environments, accompanied by various interfering sounds and reverberation. Most previous methods struggle to cope with such complex conditions, resulting in poor perceptual quality of the extracted speech. In this paper, we propose an effective AVSE system that performs well in complex acoustic environments. Specifically, we design a "separation before dereverberation" pipeline that can be extended to other AVSE networks. The 4th COGMHEAR Audio-Visual Speech Enhancement Challenge (AVSEC) aims to explore new approaches to speech processing in multimodal complex environments. We validated the performance of our system in AVSEC-4: we achieved excellent results in the three objective metrics on the competition leaderboard, and ultimately secured first place in the human subjective listening test.

Problem

Research questions and friction points this paper is trying to address.

Enhances speech in complex acoustic environments

Separates target speech from interfering sounds

Reduces reverberation effects in audio-visual scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Separation before dereverberation pipeline design

Joint modeling for complex acoustic environments

Extendable audio-visual speech enhancement network

🔎 Similar Papers

No similar papers found.

Authors to Follow