Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement

📅 2026-03-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of existing large language model (LLM)-based audio-visual speech recognition approaches, which suffer from shallow modality alignment and computational redundancy, thereby failing to effectively exploit complementary cross-modal information. To overcome these challenges, the authors propose the AVUR-LLM framework, which introduces sparse modality alignment to reduce redundancy and incorporates a visual semantic unit–guided fine-grained fusion mechanism to enhance the depth and robustness of cross-modal interaction. Evaluated on the LRS3 dataset, the proposed method achieves state-of-the-art performance in audio-visual speech recognition, yielding a 37% relative reduction in word error rate compared to the baseline system under 0 dB signal-to-noise ratio conditions.

Technology Category

Application Category

📝 Abstract

Audio-Visual Speech Recognition (AVSR) integrates acoustic and visual information to enhance robustness in adverse acoustic conditions. Recent advances in Large Language Models (LLMs) have yielded competitive automatic speech recognition performance and shown effectiveness for AVSR. However, prior approaches project audio and visual features independently or apply shallow fusion, limiting cross-modal alignment and complementary exchange while increasing the LLM's computational load. To address this, we propose AVUR-LLM, an LLM-based Audio-Visual Speech Recognition via Sparse Modality Alignment and Visual Unit-Guided Refinement. Experiments on LRS3 demonstrate state-of-the-art results for AVSR. Under additive-noise conditions at 0 dB SNR, it achieves 37% relative improvement over the baseline system.

Problem

Research questions and friction points this paper is trying to address.

Audio-Visual Speech Recognition

Large Language Models

Cross-modal Alignment

Multimodal Fusion

Robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Modality Alignment

Visual Unit-Guided Refinement

LLM-based AVSR

Cross-modal Alignment

Audio-Visual Speech Recognition

🔎 Similar Papers

No similar papers found.

Authors to Follow