Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition

📅 2025-01-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional audio-visual speech recognition (AVSR) systems rely solely on audio inputs and lack multimodal, collaborative error correction mechanisms. To address this, we propose AVGER—a generative audio-visual joint error correction paradigm. First, an AVSR model generates an N-best hypothesis list; then, a Q-former-based multimodal synchronizer compresses synchronized audio-visual streams into cross-modal representations compatible with large language models (LLMs). Subsequently, a cross-modal prompt—integrating candidate hypotheses and visual cues—is fed to an LLM to produce the final accurate transcription. Our key innovations include a novel “listen–watch–rerecognize” three-stage correction pipeline and multi-level consistency-constrained training (operating on logits, utterances, and representations), enhancing both correction accuracy and representation interpretability. On LRS3, AVGER reduces word error rate (WER) by 24% over state-of-the-art AVSR baselines, demonstrating substantial improvements in robustness and accuracy.

Technology Category

Application Category

📝 Abstract
Unlike traditional Automatic Speech Recognition (ASR), Audio-Visual Speech Recognition (AVSR) takes audio and visual signals simultaneously to infer the transcription. Recent studies have shown that Large Language Models (LLMs) can be effectively used for Generative Error Correction (GER) in ASR by predicting the best transcription from ASR-generated N-best hypotheses. However, these LLMs lack the ability to simultaneously understand audio and visual, making the GER approach challenging to apply in AVSR. In this work, we propose a novel GER paradigm for AVSR, termed AVGER, that follows the concept of ``listening and seeing again''. Specifically, we first use the powerful AVSR system to read the audio and visual signals to get the N-Best hypotheses, and then use the Q-former-based Multimodal Synchronous Encoder to read the audio and visual information again and convert them into an audio and video compression representation respectively that can be understood by LLM. Afterward, the audio-visual compression representation and the N-Best hypothesis together constitute a Cross-modal Prompt to guide the LLM in producing the best transcription. In addition, we also proposed a Multi-Level Consistency Constraint training criterion, including logits-level, utterance-level and representations-level, to improve the correction accuracy while enhancing the interpretability of audio and visual compression representations. The experimental results on the LRS3 dataset show that our method outperforms current mainstream AVSR systems. The proposed AVGER can reduce the Word Error Rate (WER) by 24% compared to them. Code and models can be found at: https://github.com/CircleRedRain/AVGER.
Problem

Research questions and friction points this paper is trying to address.

Speech Recognition
Audio-Visual Integration
Error Correction
Innovation

Methods, ideas, or system contributions that make the work stand out.

AVGER
Multimodal Information Fusion
Error Reduction in Speech Recognition
🔎 Similar Papers
No similar papers found.