🤖 AI Summary
Current audio-visual speech recognition (AVSR) models achieve strong performance on standard benchmarks, yet their generalization capabilities remain questionable due to potential adaptive overfitting. This work constructs an unseen evaluation subset that strictly aligns with the LRS3 test set in acoustic, visual, and demographic distributions, enabling the first systematic assessment of mainstream AVSR models under distribution-matched conditions. Controlled sampling from the large-scale MultiVSR dataset and fine-grained attribute analysis reveal that five state-of-the-art AVSR architectures all suffer significant performance degradation—sometimes even underperforming audio-only baselines. The study further identifies key attribute factors and lexical biases responsible for this collapse. The newly curated matched test set is publicly released to advance research on robust AVSR systems.
📝 Abstract
Current Audio-Visual Speech Recognition (AVSR) models achieve near-perfect performance on the standard LRS3 benchmark, raising concerns of adaptive overfitting. To systematically assess true generalisability, we construct a highly controlled, unseen evaluation set subsampled from the massive MultiVSR dataset. Unlike standard out-of-distribution benchmarks, our subset strictly matches the acoustic, visual, and demographic distributions of the LRS3 test set. Evaluating five state-of-the-art architectures reveals a universal performance collapse, proving that current systems fail to generalise even under strictly aligned conditions. Through a fine-grained attribute analysis across seven factors, we isolate the specific drivers of this degradation. Furthermore, we uncover a profound lexical bias, expose distinct error patterns, and surprisingly reveal that audio-visual performance even lags behind audio-only settings. We release our matched test set for future benchmarking.