Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address degraded speech enhancement performance in complex acoustic environments with concurrent multiple interfering sources and reverberation, this paper proposes a cascaded joint modeling paradigm: “separation first, then dereverberation.” Methodologically, we introduce a novel two-stage deep network that fuses audio-visual features: the first stage leverages visual cues to guide target speaker separation, while the second stage performs targeted dereverberation in the time-frequency domain using attention mechanisms. The framework supports end-to-end joint optimization and exhibits strong modularity and extensibility. Evaluated on the AVSEC-4 challenge, our method achieves state-of-the-art performance—ranking first across all three objective metrics (STOI, PESQ, ESTOI) and attaining the highest subjective MOS score. It significantly improves speech clarity, naturalness, and intelligibility in realistic scenarios, demonstrating both the advancement and practicality of multimodal speech degradation modeling.

Technology Category

Application Category

📝 Abstract
Audio-visual speech enhancement (AVSE) is a task that uses visual auxiliary information to extract a target speaker's speech from mixed audio. In real-world scenarios, there often exist complex acoustic environments, accompanied by various interfering sounds and reverberation. Most previous methods struggle to cope with such complex conditions, resulting in poor perceptual quality of the extracted speech. In this paper, we propose an effective AVSE system that performs well in complex acoustic environments. Specifically, we design a "separation before dereverberation" pipeline that can be extended to other AVSE networks. The 4th COGMHEAR Audio-Visual Speech Enhancement Challenge (AVSEC) aims to explore new approaches to speech processing in multimodal complex environments. We validated the performance of our system in AVSEC-4: we achieved excellent results in the three objective metrics on the competition leaderboard, and ultimately secured first place in the human subjective listening test.
Problem

Research questions and friction points this paper is trying to address.

Enhances speech in complex acoustic environments
Separates target speech from interfering sounds
Reduces reverberation effects in audio-visual scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Separation before dereverberation pipeline design
Joint modeling for complex acoustic environments
Extendable audio-visual speech enhancement network
🔎 Similar Papers
No similar papers found.
J
Jiarong Du
School of Cyber Science and Engineering, Wuhan University, Wuhan, China
Z
Zhan Jin
School of Computer Science, Wuhan University, Wuhan, China
P
Peijun Yang
School of Cyber Science and Engineering, Wuhan University, Wuhan, China
Juan Liu
Juan Liu
Wuhan University
Data MiningArtificial Intelligence in BioinformaticsBiomedicine
Z
Zhuo Li
Hardware Engineering System, OPPO, Beijing, China
X
Xin Liu
Hardware Engineering System, OPPO, Beijing, China
M
Ming Li
School of Artificial Intelligence, Wuhan University, Wuhan, China