Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement

πŸ“… 2026-03-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing large language model (LLM)-based audio-visual speech recognition approaches, which suffer from shallow modality alignment and computational redundancy, thereby failing to effectively exploit complementary cross-modal information. To overcome these challenges, the authors propose the AVUR-LLM framework, which introduces sparse modality alignment to reduce redundancy and incorporates a visual semantic unit–guided fine-grained fusion mechanism to enhance the depth and robustness of cross-modal interaction. Evaluated on the LRS3 dataset, the proposed method achieves state-of-the-art performance in audio-visual speech recognition, yielding a 37% relative reduction in word error rate compared to the baseline system under 0 dB signal-to-noise ratio conditions.

Technology Category

Application Category

πŸ“ Abstract
Audio-Visual Speech Recognition (AVSR) integrates acoustic and visual information to enhance robustness in adverse acoustic conditions. Recent advances in Large Language Models (LLMs) have yielded competitive automatic speech recognition performance and shown effectiveness for AVSR. However, prior approaches project audio and visual features independently or apply shallow fusion, limiting cross-modal alignment and complementary exchange while increasing the LLM's computational load. To address this, we propose AVUR-LLM, an LLM-based Audio-Visual Speech Recognition via Sparse Modality Alignment and Visual Unit-Guided Refinement. Experiments on LRS3 demonstrate state-of-the-art results for AVSR. Under additive-noise conditions at 0 dB SNR, it achieves 37% relative improvement over the baseline system.
Problem

Research questions and friction points this paper is trying to address.

Audio-Visual Speech Recognition
Large Language Models
Cross-modal Alignment
Multimodal Fusion
Robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Modality Alignment
Visual Unit-Guided Refinement
LLM-based AVSR
Cross-modal Alignment
Audio-Visual Speech Recognition
πŸ”Ž Similar Papers
No similar papers found.
F
Fei Su
School of Computer Science, Wuhan University, China; School of Artificial Intelligence, Wuhan University, China
C
Cancan Li
School of Computer Science, Wuhan University, China; School of Artificial Intelligence, Wuhan University, China
Juan Liu
Juan Liu
Wuhan University
Data MiningArtificial Intelligence in BioinformaticsBiomedicine
W
Wei Ju
AI Center, OPPO, China
H
Hongbin Suo
AI Center, OPPO, China
Ming Li
Ming Li
Professor, Duke Kunshan University
Speech ProcessingAudio ProcessingAffective ComputingBehavior Signal Processing