🤖 AI Summary
This work addresses the high data requirements and scarcity of labeled speech data for Speech Large Language Models (Speech LLMs) in low-resource languages. To this end, we propose a lightweight cross-lingual modality alignment method built upon the SLAM-ASR framework. Specifically, we adopt Whisper-large-v3-turbo as the speech encoder and integrate multilingual LLMs—including EuroLLM and Salamandra—via trainable monolingual and multilingual projection layers that map speech embeddings into textual semantic space. Our key contribution lies in leveraging projection modules pretrained on high-resource languages to alleviate data dependency in low-resource settings. Experiments on multiple low-resource ASR benchmarks demonstrate that our approach achieves performance comparable to or approaching that of Whisper-only models using only a small amount of labeled data—significantly reducing annotation costs. These results validate the effectiveness and strong generalization capability of multilingually pretrained projection modules for few-shot speech recognition.
📝 Abstract
Large language models (LLMs) have demonstrated potential in handling spoken inputs for high-resource languages, reaching state-of-the-art performance in various tasks. However, their applicability is still less explored in low-resource settings. This work investigates the use of Speech LLMs for low-resource Automatic Speech Recognition using the SLAM-ASR framework, where a trainable lightweight projector connects a speech encoder and a LLM. Firstly, we assess training data volume requirements to match Whisper-only performance, re-emphasizing the challenges of limited data. Secondly, we show that leveraging mono- or multilingual projectors pretrained on high-resource languages reduces the impact of data scarcity, especially with small training sets. Using multilingual LLMs (EuroLLM, Salamandra) with whisper-large-v3-turbo, we evaluate performance on several public benchmarks, providing insights for future research on optimizing Speech LLMs for low-resource languages and multilinguality.