VALLR: Visual ASR Language Model for Lip Reading

📅 2025-03-27

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Lip reading (visual automatic speech recognition) is highly challenging due to the absence of auditory cues and inherent visual ambiguity of phonemes. To address this, we propose a phoneme-centric two-stage decoupled framework: Stage I employs a Video Transformer with Connectionist Temporal Classification (CTC) loss to robustly predict compact phoneme sequences; Stage II leverages a fine-tuned large language model (LLM) to map these phoneme sequences into semantically coherent words and sentences. This paradigm explicitly models intermediate linguistic structure, enhancing robustness while significantly improving data efficiency. Our method achieves state-of-the-art performance on LRS2 and LRS3 benchmarks, reducing the word error rate (WER) on the LRS3 test set to 18.7%. Remarkably, it surpasses prior best methods using only 0.6% of labeled training data—marking the first demonstration of high-accuracy, speaker-independent lip reading under extremely low-resource conditions.

Technology Category

Application Category

📝 Abstract

Lip Reading, or Visual Automatic Speech Recognition (V-ASR), is a complex task requiring the interpretation of spoken language exclusively from visual cues, primarily lip movements and facial expressions. This task is especially challenging due to the absence of auditory information and the inherent ambiguity when visually distinguishing phonemes that have overlapping visemes where different phonemes appear identical on the lips. Current methods typically attempt to predict words or characters directly from these visual cues, but this approach frequently encounters high error rates due to coarticulation effects and viseme ambiguity. We propose a novel two-stage, phoneme-centric framework for Visual Automatic Speech Recognition (V-ASR) that addresses these longstanding challenges. First, our model predicts a compact sequence of phonemes from visual inputs using a Video Transformer with a CTC head, thereby reducing the task complexity and achieving robust speaker invariance. This phoneme output then serves as the input to a fine-tuned Large Language Model (LLM), which reconstructs coherent words and sentences by leveraging broader linguistic context. Unlike existing methods that either predict words directly-often faltering on visually similar phonemes-or rely on large-scale multimodal pre-training, our approach explicitly encodes intermediate linguistic structure while remaining highly data efficient. We demonstrate state-of-the-art performance on two challenging datasets, LRS2 and LRS3, where our method achieves significant reductions in Word Error Rate (WER) achieving a SOTA WER of 18.7 on LRS3 despite using 99.4% less labelled data than the next best approach.

Problem

Research questions and friction points this paper is trying to address.

Improves lip reading accuracy by reducing phoneme ambiguity

Addresses coarticulation effects in visual speech recognition

Enhances data efficiency for visual ASR using phoneme-centric approach

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage phoneme-centric V-ASR framework

Video Transformer with CTC head

Fine-tuned LLM for linguistic context

🔎 Similar Papers

Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language