🤖 AI Summary
This work addresses the degradation and asynchrony of audio-visual modalities in real-world scenarios caused by viewpoint variations, audio distortions, and visual occlusions. To tackle these challenges, the authors propose a modality-aware, multi-view self-supervised representation learning framework that employs multi-view encoders to learn viewpoint-invariant visual speech representations. A modality-aware module is introduced to jointly model modality quality and cross-modal synchronicity, enabling fine-grained fusion. The study presents AISHELL8-RealScene, the first large-scale, multi-scenario, multi-view Chinese audio-visual dataset captured in realistic environments. The proposed method achieves a relative improvement of 29.4% under viewpoint perturbations and visual degradations on LRS3, sets a new state-of-the-art on the MISP2021-AVSR test set, and demonstrates superior performance in outdoor scenes of AISHELL8-RealScene.
📝 Abstract
Audio-Visual Speech Recognition (AVSR) enhances speech recognition robustness by leveraging visual cues, while real-world scenarios remain challenging due to viewpoint variation, audio distortion, and visual occlusion, which degrade modality quality and increase audio-visual asynchrony. In this paper, we propose a novel Modality-aware Multi-view Self-supervised representation framework for robust Audio-Visual Speech Recognition (M2S-AVSR). First, we introduce a multi-view representation learning encoder to learn view-invariant visual speech representations. Next, we employ a modality-aware module that explicitly models modality quality and cross-modal synchrony to perform fine-grained modality-aware fusion, enabling fine-grained visual information injection during decoding. In addition, we present AISHELL8-RealScene, a public multi-scenario, multi-view conversational audio-visual dataset recorded in real-world environments, and establish a speech recognition benchmark on it. Experiments on English and Mandarin benchmarks demonstrate the effectiveness of the proposed method under challenging conditions. On LRS3, M2S-AVSR achieves up to 29.4% relative improvement under viewpoint perturbation and visual degradation settings. Our method also achieves new state-of-the-art performance on the MISP2021-AVSR test set. On AISHELL8-RealScene, it achieves the best result in outdoor scenes. The proposed method and dataset provide useful support for future research on robust speech and multimodal tasks under realistic conditions.