M2S-AVSR: Modality-aware Multi-view Self-supervised Representation for Robust Audio-Visual Speech Recognition

📅 2026-06-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

214K/year
🤖 AI Summary
This work addresses the degradation and asynchrony of audio-visual modalities in real-world scenarios caused by viewpoint variations, audio distortions, and visual occlusions. To tackle these challenges, the authors propose a modality-aware, multi-view self-supervised representation learning framework that employs multi-view encoders to learn viewpoint-invariant visual speech representations. A modality-aware module is introduced to jointly model modality quality and cross-modal synchronicity, enabling fine-grained fusion. The study presents AISHELL8-RealScene, the first large-scale, multi-scenario, multi-view Chinese audio-visual dataset captured in realistic environments. The proposed method achieves a relative improvement of 29.4% under viewpoint perturbations and visual degradations on LRS3, sets a new state-of-the-art on the MISP2021-AVSR test set, and demonstrates superior performance in outdoor scenes of AISHELL8-RealScene.
📝 Abstract
Audio-Visual Speech Recognition (AVSR) enhances speech recognition robustness by leveraging visual cues, while real-world scenarios remain challenging due to viewpoint variation, audio distortion, and visual occlusion, which degrade modality quality and increase audio-visual asynchrony. In this paper, we propose a novel Modality-aware Multi-view Self-supervised representation framework for robust Audio-Visual Speech Recognition (M2S-AVSR). First, we introduce a multi-view representation learning encoder to learn view-invariant visual speech representations. Next, we employ a modality-aware module that explicitly models modality quality and cross-modal synchrony to perform fine-grained modality-aware fusion, enabling fine-grained visual information injection during decoding. In addition, we present AISHELL8-RealScene, a public multi-scenario, multi-view conversational audio-visual dataset recorded in real-world environments, and establish a speech recognition benchmark on it. Experiments on English and Mandarin benchmarks demonstrate the effectiveness of the proposed method under challenging conditions. On LRS3, M2S-AVSR achieves up to 29.4% relative improvement under viewpoint perturbation and visual degradation settings. Our method also achieves new state-of-the-art performance on the MISP2021-AVSR test set. On AISHELL8-RealScene, it achieves the best result in outdoor scenes. The proposed method and dataset provide useful support for future research on robust speech and multimodal tasks under realistic conditions.
Problem

Research questions and friction points this paper is trying to address.

Audio-Visual Speech Recognition
modality quality
audio-visual asynchrony
viewpoint variation
visual occlusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-view representation
modality-aware fusion
self-supervised learning
audio-visual speech recognition
real-world robustness
🔎 Similar Papers
2023-05-05Computer Vision and Image UnderstandingCitations: 6