🤖 AI Summary
This work addresses the limitations of current automatic speech recognition (ASR) systems, which are typically confined to monolingual settings and isolated short utterances, thereby struggling to effectively leverage multilingual capabilities and contextual information. The authors propose a context-aware multilingual ASR framework that connects a frozen speech encoder to a decoder-only language model via a lightweight projection module, incorporating structured prompts—such as dialogue history and biasing terms—to guide transcription. Notably, this is the first approach to integrate structured contextual prompting with contrastive learning in multilingual ASR, enabling cross-modal alignment between speech and contextual representations. Evaluated on real-world conversational data spanning 11 languages and 5 English dialects (over 1,500 hours in total), the method achieves substantial performance gains, with both contextual inputs and contrastive alignment contributing consistent improvements and yielding an overall word error rate reduction exceeding 5%.
📝 Abstract
Automatic speech recognition (ASR) has benefited from advances in pretrained speech and language models, yet most systems remain constrained to monolingual settings and short, isolated utterances. While recent efforts in context-aware ASR show promise, two key challenges persist: limited multilingual support and the absence of principled alignment between speech and contextual representations. In this paper, we introduce a context-aware multilingual ASR framework that supports diverse languages and accents while preserving the modularity of pretrained models. Our approach combines a frozen speech encoder and a decoder-only language model via a lightweight projection module, allowing structured context prompts, including dialogue history and biasing words, to guide transcription. To improve interaction between speech and context, we employ a contrastive learning objective that aligns their representations in a shared embedding space. Evaluations on over 1,500 hours of real-world conversational speech across 11 languages and 5 English dialects show that contextual input consistently improves recognition quality. Contrastive alignment provides additional gains when applied to different context types, with an overall performance gain of over 5%. These results highlight the importance of both contextual modeling and cross-modal alignment in multilingual ASR.