🤖 AI Summary
This study addresses the limited robustness of end-to-end Spanish continuous visual speech recognition (lip-reading) under data scarcity, acoustic noise, and cross-speaker variability. To this end, we propose the first end-to-end Spanish lip-reading system, featuring a low-resource-adapted temporal modeling strategy based on a Transformer architecture. Our approach integrates joint CTC–attention decoding, visual feature enhancement, and synthetic data augmentation. We conduct the first systematic evaluation of Spanish lip-reading generalization across diverse realistic conditions—including visual ambiguity, inter-speaker articulatory variation, and silent frames. On the Spanish-LRS benchmark, our model achieves a word error rate (WER) of 38.2%, representing a significant 9.7% absolute reduction over the baseline. Notably, it maintains structural modeling capability for speech streams even in few-shot settings, demonstrating strong adaptability to challenging visual conditions.