Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages

πŸ“… 2025-03-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Poor performance on low-resource languages remains a critical bottleneck for modern ASR systems. This paper proposes a two-stage enhancement framework that synergistically integrates statistical n-gram language models (LMs) with large language models (LLMs) to systematically improve Whisper’s robustness in low-resource settings: first, shallow decoding is constrained via n-gram scoring; second, candidate hypotheses are semantically rescored using an LLM. To our knowledge, this is the first empirical demonstration of complementary gains from jointly leveraging n-gram and LLM components during Whisper fine-tuning, and it reveals key principles governing the co-optimization of model scale and LM parameters. Under a unified cross-dataset evaluation protocol, our method achieves up to 51% relative reduction in word error rate on in-distribution test sets and improves out-of-distribution sentence accuracy by up to 34%. The LLM-based rescoring strategy delivers consistent, moderate gains across multilingual scenarios.

Technology Category

Application Category

πŸ“ Abstract
Automatic speech recognition systems have undoubtedly advanced with the integration of multilingual and multitask models such as Whisper, which have shown a promising ability to understand and process speech across a wide range of languages. Despite their robustness, these models often fall short in handling the linguistic distinctions of minority languages. This study addresses this gap by integrating traditional and novel language models with fine-tuned Whisper models to raise their performance in less commonly studied languages. Through rigorous fine-tuning and evaluation across multiple datasets, we demonstrate substantial improvements in word error rate, particularly in low-resource scenarios. Our approach not only does take advantage of the extensive data Whisper was pre-trained on, but also complements its linguistic adaptability by incorporating language models. We obtained improvements up to 51% for in-distribution datasets and up to 34% for out-of-distribution sentences using statistical language models, while large language models provided moderate but consistently robust improvement across diverse linguistic contexts. The findings reveal that, while the integration reliably benefits all model sizes, the extent of improvement varies, highlighting the importance of optimized language model parameters. Finally, we emphasize the importance of selecting appropriate evaluation parameters when reporting the results using transformer-based ASR models. In summary, this research clears the way for more inclusive ASR technologies that perform better across languages by enriching their linguistic knowledge. For further implementation details of this study, the technical documentation and source code are available at http://www.github.com/hitz-zentroa/whisper-lm.
Problem

Research questions and friction points this paper is trying to address.

Improves ASR for low-resource languages using language models
Reduces word error rates in minority language scenarios
Enhances Whisper models with optimized linguistic adaptability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned Whisper models with language models
Improved word error rate in low-resource languages
Combined statistical and large language models
X
Xabier de Zuazo
HiTZ - University of the Basque Country - UPV/EHU, Spain
Eva Navas
Eva Navas
University of the Basque Country
Speech synthesisspeaker diarization
Ibon Saratxaga
Ibon Saratxaga
University of the Basque Country (UPV/EHU)
Speechsound classification
I
Inma Hern'aez Rioja
HiTZ - University of the Basque Country - UPV/EHU, Spain