M2S-AVSR: Modality-aware Multi-view Self-supervised Representation for Robust Audio-Visual Speech Recognition

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the degradation and asynchrony of audio-visual modalities in real-world scenarios caused by viewpoint variations, audio distortions, and visual occlusions. To tackle these challenges, the authors propose a modality-aware, multi-view self-supervised representation learning framework that employs multi-view encoders to learn viewpoint-invariant visual speech representations. A modality-aware module is introduced to jointly model modality quality and cross-modal synchronicity, enabling fine-grained fusion. The study presents AISHELL8-RealScene, the first large-scale, multi-scenario, multi-view Chinese audio-visual dataset captured in realistic environments. The proposed method achieves a relative improvement of 29.4% under viewpoint perturbations and visual degradations on LRS3, sets a new state-of-the-art on the MISP2021-AVSR test set, and demonstrates superior performance in outdoor scenes of AISHELL8-RealScene.

📝 Abstract

Audio-Visual Speech Recognition (AVSR) enhances speech recognition robustness by leveraging visual cues, while real-world scenarios remain challenging due to viewpoint variation, audio distortion, and visual occlusion, which degrade modality quality and increase audio-visual asynchrony. In this paper, we propose a novel Modality-aware Multi-view Self-supervised representation framework for robust Audio-Visual Speech Recognition (M2S-AVSR). First, we introduce a multi-view representation learning encoder to learn view-invariant visual speech representations. Next, we employ a modality-aware module that explicitly models modality quality and cross-modal synchrony to perform fine-grained modality-aware fusion, enabling fine-grained visual information injection during decoding. In addition, we present AISHELL8-RealScene, a public multi-scenario, multi-view conversational audio-visual dataset recorded in real-world environments, and establish a speech recognition benchmark on it. Experiments on English and Mandarin benchmarks demonstrate the effectiveness of the proposed method under challenging conditions. On LRS3, M2S-AVSR achieves up to 29.4% relative improvement under viewpoint perturbation and visual degradation settings. Our method also achieves new state-of-the-art performance on the MISP2021-AVSR test set. On AISHELL8-RealScene, it achieves the best result in outdoor scenes. The proposed method and dataset provide useful support for future research on robust speech and multimodal tasks under realistic conditions.

Problem

Research questions and friction points this paper is trying to address.

Audio-Visual Speech Recognition

modality quality

audio-visual asynchrony

viewpoint variation

visual occlusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-view representation

modality-aware fusion

self-supervised learning