MERaLiON-SpeechEncoder: Towards a Speech Foundation Model for Singapore and Beyond

📅 2024-12-16

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

🤖 AI Summary

To address the multilingual speech processing needs of Singapore and Southeast Asia—particularly the challenges of recognizing and generating localized variants such as Singaporean English (Singlish)—this work introduces the first region-specific speech foundation model. The model is trained from scratch via masked self-supervised learning on 200,000 hours of unlabeled speech data, incorporating tailored modeling of regional phonetic characteristics and multi-task adaptation. Key contributions include a novel data engineering pipeline and hyperparameter optimization strategy explicitly designed for the Southeast Asian linguistic ecosystem. Experiments demonstrate substantial improvements over prior methods on spontaneous speech and Singlish ASR benchmarks; moreover, the model achieves performance on par with state-of-the-art speech encoders across 11 diverse downstream speech tasks. To foster regional advancement in speech AI, the model will be publicly released.

Technology Category

Application Category

📝 Abstract

This technical report describes the MERaLiON-SpeechEncoder, a foundation model designed to support a wide range of downstream speech applications. Developed as part of Singapore's National Multimodal Large Language Model Programme, the MERaLiON-SpeechEncoder is tailored to address the speech processing needs in Singapore and the surrounding Southeast Asian region. The model currently supports mainly English, including the variety spoken in Singapore. We are actively expanding our datasets to gradually cover other languages in subsequent releases. The MERaLiON-SpeechEncoder was pre-trained from scratch on 200,000 hours of unlabelled speech data using a self-supervised learning approach based on masked language modelling. We describe our training procedure and hyperparameter tuning experiments in detail below. Our evaluation demonstrates improvements to spontaneous and Singapore speech benchmarks for speech recognition, while remaining competitive to other state-of-the-art speech encoders across ten other speech tasks. We commit to releasing our model, supporting broader research endeavours, both in Singapore and beyond.

Problem

Research questions and friction points this paper is trying to address.

Develops a speech foundation model for Singapore and Southeast Asia

Addresses speech processing needs with Singapore English support

Improves speech recognition for spontaneous and Singaporean benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised masked language modeling training

200,000 hours unlabelled speech pre-training

Tailored for Singapore English speech recognition

🔎 Similar Papers

No similar papers found.

Authors to Follow