🤖 AI Summary
This study addresses the challenges of uneven information distribution and dynamic temporal variations in silent-video lip reading. To this end, we propose a non-uniform sequence modeling framework. Our method introduces: (1) a non-uniform quantization module that enables fine-grained temporal alignment between encoder and decoder while adaptively focusing on salient frames; and (2) a lightweight temporal convolutional architecture integrated with customized data augmentation and dynamic frame sampling, enhancing robustness against illumination changes, head pose variations, and other visual distortions. Evaluated on the LRW and LRW1000 benchmarks, our approach achieves state-of-the-art Top-1 accuracies of 92.0% and 60.7%, respectively—surpassing prior methods. These results significantly advance the practical deployment of visual speech recognition in assistive technologies and augmented reality applications.
📝 Abstract
Estimating spoken content from silent videos is crucial for applications in Assistive Technology (AT) and Augmented Reality (AR). However, accurately mapping lip movement sequences in videos to words poses significant challenges due to variability across sequences and the uneven distribution of information within each sequence. To tackle this, we introduce InfoSyncNet, a non-uniform sequence modeling network enhanced by tailored data augmentation techniques. Central to InfoSyncNet is a non-uniform quantization module positioned between the encoder and decoder, enabling dynamic adjustment to the network's focus and effectively handling the natural inconsistencies in visual speech data. Additionally, multiple training strategies are incorporated to enhance the model's capability to handle variations in lighting and the speaker's orientation. Comprehensive experiments on the LRW and LRW1000 datasets confirm the superiority of InfoSyncNet, achieving new state-of-the-art accuracies of 92.0% and 60.7% Top-1 ACC. The code is available for download (see comments).