Speech LLMs in Low-Resource Scenarios: Data Volume Requirements and the Impact of Pretraining on High-Resource Languages

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high data requirements and scarcity of labeled speech data for Speech Large Language Models (Speech LLMs) in low-resource languages. To this end, we propose a lightweight cross-lingual modality alignment method built upon the SLAM-ASR framework. Specifically, we adopt Whisper-large-v3-turbo as the speech encoder and integrate multilingual LLMs—including EuroLLM and Salamandra—via trainable monolingual and multilingual projection layers that map speech embeddings into textual semantic space. Our key contribution lies in leveraging projection modules pretrained on high-resource languages to alleviate data dependency in low-resource settings. Experiments on multiple low-resource ASR benchmarks demonstrate that our approach achieves performance comparable to or approaching that of Whisper-only models using only a small amount of labeled data—significantly reducing annotation costs. These results validate the effectiveness and strong generalization capability of multilingually pretrained projection modules for few-shot speech recognition.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated potential in handling spoken inputs for high-resource languages, reaching state-of-the-art performance in various tasks. However, their applicability is still less explored in low-resource settings. This work investigates the use of Speech LLMs for low-resource Automatic Speech Recognition using the SLAM-ASR framework, where a trainable lightweight projector connects a speech encoder and a LLM. Firstly, we assess training data volume requirements to match Whisper-only performance, re-emphasizing the challenges of limited data. Secondly, we show that leveraging mono- or multilingual projectors pretrained on high-resource languages reduces the impact of data scarcity, especially with small training sets. Using multilingual LLMs (EuroLLM, Salamandra) with whisper-large-v3-turbo, we evaluate performance on several public benchmarks, providing insights for future research on optimizing Speech LLMs for low-resource languages and multilinguality.
Problem

Research questions and friction points this paper is trying to address.

Assess data volume needs for Speech LLMs in low-resource ASR
Explore pretraining impact of high-resource languages on low-resource performance
Optimize multilingual Speech LLMs for limited training data scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight projector connects speech encoder and LLM
Pretrained projectors reduce low-resource data impact
Multilingual LLMs optimize performance for scarce data
🔎 Similar Papers
No similar papers found.
S
Seraphina Fong
Department of Information Engineering and Computer Science, University of Trento, Italy
Marco Matassoni
Marco Matassoni
Fondazione Bruno Kessler
speech recognition signal processing
Alessio Brutti
Alessio Brutti
FBK
audio/speech processing