π€ AI Summary
Lip reading (visual automatic speech recognition) is highly challenging due to the absence of auditory cues and inherent visual ambiguity of phonemes. To address this, we propose a phoneme-centric two-stage decoupled framework: Stage I employs a Video Transformer with Connectionist Temporal Classification (CTC) loss to robustly predict compact phoneme sequences; Stage II leverages a fine-tuned large language model (LLM) to map these phoneme sequences into semantically coherent words and sentences. This paradigm explicitly models intermediate linguistic structure, enhancing robustness while significantly improving data efficiency. Our method achieves state-of-the-art performance on LRS2 and LRS3 benchmarks, reducing the word error rate (WER) on the LRS3 test set to 18.7%. Remarkably, it surpasses prior best methods using only 0.6% of labeled training dataβmarking the first demonstration of high-accuracy, speaker-independent lip reading under extremely low-resource conditions.
π Abstract
Lip Reading, or Visual Automatic Speech Recognition (V-ASR), is a complex task requiring the interpretation of spoken language exclusively from visual cues, primarily lip movements and facial expressions. This task is especially challenging due to the absence of auditory information and the inherent ambiguity when visually distinguishing phonemes that have overlapping visemes where different phonemes appear identical on the lips. Current methods typically attempt to predict words or characters directly from these visual cues, but this approach frequently encounters high error rates due to coarticulation effects and viseme ambiguity. We propose a novel two-stage, phoneme-centric framework for Visual Automatic Speech Recognition (V-ASR) that addresses these longstanding challenges. First, our model predicts a compact sequence of phonemes from visual inputs using a Video Transformer with a CTC head, thereby reducing the task complexity and achieving robust speaker invariance. This phoneme output then serves as the input to a fine-tuned Large Language Model (LLM), which reconstructs coherent words and sentences by leveraging broader linguistic context. Unlike existing methods that either predict words directly-often faltering on visually similar phonemes-or rely on large-scale multimodal pre-training, our approach explicitly encodes intermediate linguistic structure while remaining highly data efficient. We demonstrate state-of-the-art performance on two challenging datasets, LRS2 and LRS3, where our method achieves significant reductions in Word Error Rate (WER) achieving a SOTA WER of 18.7 on LRS3 despite using 99.4% less labelled data than the next best approach.