🤖 AI Summary
To address the limited cross-lingual transferability and severe domain mismatch of self-supervised learning (SSL) pre-trained models in automatic speech recognition (ASR) for low-resource languages, this paper proposes a lightweight adapter method with *intermediate warm-start*. Under frozen SSL backbone constraints, only 1–5% of parameters are fine-tuned. A two-stage progressive adaptation jointly optimizes adapter architecture and downstream model initialization. The novel intermediate warm-start mechanism mitigates speech feature distribution shift, substantially improving generalization to unseen languages. Evaluated on the ML-SUPERB benchmark, our approach achieves up to 28% relative reduction in character/phone error rates over standard efficient fine-tuning, significantly alleviating the bottleneck in low-resource cross-lingual ASR adaptation.
📝 Abstract
The utilization of speech Self-Supervised Learning (SSL) models achieves impressive performance on Automatic Speech Recognition (ASR). However, in low-resource language ASR, they encounter the domain mismatch problem between pre-trained and low-resource languages. Typical solutions like fine-tuning the SSL model suffer from high computation costs while using frozen SSL models as feature extractors comes with poor performance. To handle these issues, we extend a conventional efficient fine-tuning scheme based on the adapter. We add an extra intermediate adaptation to warm up the adapter and downstream model initialization. Remarkably, we update only 1-5% of the total model parameters to achieve the adaptation. Experimental results on the ML-SUPERB dataset show that our solution outperforms conventional efficient fine-tuning. It achieves up to a 28% relative improvement in the Character/Phoneme error rate when adapting to unseen languages.