🤖 AI Summary
This work investigates the synergistic mechanisms among audio, lip motion, and visual context modalities in multimodal speech recognition, and characterizes their complementary gains under varying noise conditions and data distributions. We propose a decoder-only discrete-token language model framework for multimodal fusion, introducing two key innovations: cross-modal feature alignment and vision-guided pre-filtering of visual information. Evaluation is conducted on both synthetic and real-world datasets. Our empirical study is the first to demonstrate that jointly leveraging audio, lip motion, and scene-level visual context significantly improves recognition accuracy. Notably, the image modality contributes most under moderate acoustic noise—distinct from synchronous modalities like lip motion—highlighting its unique role as an asynchronous contextual cue. Results indicate that optimal modality selection must be adaptive to noise level and data characteristics, establishing a new paradigm for robust multimodal speech recognition.
📝 Abstract
Decoder-only discrete-token language models have recently achieved significant success in automatic speech recognition. However, systematic analyses of how different modalities impact performance in specific scenarios remain limited. In this paper, we investigate the effects of multiple modalities on recognition accuracy on both synthetic and real-world datasets. Our experiments suggest that: (1) Integrating more modalities can increase accuracy; in particular, our paper is, to our best knowledge, the first to show the benefit of combining audio, image context, and lip information; (2) Images as a supplementary modality for speech recognition provide the greatest benefit at moderate noise levels, moreover, they exhibit a different trend compared to inherently synchronized modalities like lip movements; (3) Performance improves on both synthetic and real-world datasets when the most relevant visual information is filtered as a preprocessing step.