Multi-modal Speech Transformer Decoders: When Do Multiple Modalities Improve Accuracy?

📅 2024-09-13
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the synergistic mechanisms among audio, lip motion, and visual context modalities in multimodal speech recognition, and characterizes their complementary gains under varying noise conditions and data distributions. We propose a decoder-only discrete-token language model framework for multimodal fusion, introducing two key innovations: cross-modal feature alignment and vision-guided pre-filtering of visual information. Evaluation is conducted on both synthetic and real-world datasets. Our empirical study is the first to demonstrate that jointly leveraging audio, lip motion, and scene-level visual context significantly improves recognition accuracy. Notably, the image modality contributes most under moderate acoustic noise—distinct from synchronous modalities like lip motion—highlighting its unique role as an asynchronous contextual cue. Results indicate that optimal modality selection must be adaptive to noise level and data characteristics, establishing a new paradigm for robust multimodal speech recognition.

Technology Category

Application Category

📝 Abstract
Decoder-only discrete-token language models have recently achieved significant success in automatic speech recognition. However, systematic analyses of how different modalities impact performance in specific scenarios remain limited. In this paper, we investigate the effects of multiple modalities on recognition accuracy on both synthetic and real-world datasets. Our experiments suggest that: (1) Integrating more modalities can increase accuracy; in particular, our paper is, to our best knowledge, the first to show the benefit of combining audio, image context, and lip information; (2) Images as a supplementary modality for speech recognition provide the greatest benefit at moderate noise levels, moreover, they exhibit a different trend compared to inherently synchronized modalities like lip movements; (3) Performance improves on both synthetic and real-world datasets when the most relevant visual information is filtered as a preprocessing step.
Problem

Research questions and friction points this paper is trying to address.

Analyzes impact of multi-modalities on speech recognition accuracy
Explores benefits of combining audio, images, and lip movements
Evaluates modality effectiveness in noisy and real-world scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines audio, image context, lip information
Images help most at moderate noise
Filters relevant visual info pre-processing
🔎 Similar Papers
No similar papers found.
Y
Yiwen Guan
Worcester Polytechnic Institute
V
V. Trinh
Worcester Polytechnic Institute
V
Vivek Voleti
Worcester Polytechnic Institute
Jacob Whitehill
Jacob Whitehill
Worcester Polytechnic Institute
Artificial Intelligence