MERaLiON-SpeechEncoder: Towards a Speech Foundation Model for Singapore and Beyond

📅 2024-12-16
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
To address the multilingual speech processing needs of Singapore and Southeast Asia—particularly the challenges of recognizing and generating localized variants such as Singaporean English (Singlish)—this work introduces the first region-specific speech foundation model. The model is trained from scratch via masked self-supervised learning on 200,000 hours of unlabeled speech data, incorporating tailored modeling of regional phonetic characteristics and multi-task adaptation. Key contributions include a novel data engineering pipeline and hyperparameter optimization strategy explicitly designed for the Southeast Asian linguistic ecosystem. Experiments demonstrate substantial improvements over prior methods on spontaneous speech and Singlish ASR benchmarks; moreover, the model achieves performance on par with state-of-the-art speech encoders across 11 diverse downstream speech tasks. To foster regional advancement in speech AI, the model will be publicly released.

Technology Category

Application Category

📝 Abstract
This technical report describes the MERaLiON-SpeechEncoder, a foundation model designed to support a wide range of downstream speech applications. Developed as part of Singapore's National Multimodal Large Language Model Programme, the MERaLiON-SpeechEncoder is tailored to address the speech processing needs in Singapore and the surrounding Southeast Asian region. The model currently supports mainly English, including the variety spoken in Singapore. We are actively expanding our datasets to gradually cover other languages in subsequent releases. The MERaLiON-SpeechEncoder was pre-trained from scratch on 200,000 hours of unlabelled speech data using a self-supervised learning approach based on masked language modelling. We describe our training procedure and hyperparameter tuning experiments in detail below. Our evaluation demonstrates improvements to spontaneous and Singapore speech benchmarks for speech recognition, while remaining competitive to other state-of-the-art speech encoders across ten other speech tasks. We commit to releasing our model, supporting broader research endeavours, both in Singapore and beyond.
Problem

Research questions and friction points this paper is trying to address.

Develops a speech foundation model for Singapore and Southeast Asia
Addresses speech processing needs with Singapore English support
Improves speech recognition for spontaneous and Singaporean benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised masked language modeling training
200,000 hours unlabelled speech pre-training
Tailored for Singapore English speech recognition
🔎 Similar Papers
No similar papers found.
M
Muhammad Huzaifah
Institute for Infocomm Research (I2R), A*STAR, Singapore
Tianchi Liu
Tianchi Liu
Tencent, Singapore; Ph.D. @ National University of Singapore; Ex-A*STAR, Singapore
Text-to-SpeechSpeech-LLMSpeaker VerificationAnti-spoofingDeepfake Detection
H
Hardik B. Sailor
Institute for Infocomm Research (I2R), A*STAR, Singapore
K
Kye Min Tan
Institute for Infocomm Research (I2R), A*STAR, Singapore
T
Tarun K. Vangani
Institute for Infocomm Research (I2R), A*STAR, Singapore
Qiongqiong Wang
Qiongqiong Wang
Lead Research Engineer, Institute for Infocomm Research, A*STAR, Singapore
Deep LearningArtificial IntelligenceMachine Learning
J
Jeremy H. M. Wong
Institute for Infocomm Research (I2R), A*STAR, Singapore
Nancy F. Chen
Nancy F. Chen
ISCA Fellow, AAIA Fellow, Multimodal Generative AI Group Leader, AI for Education Head at A*STAR
Agentic AILarge Language ModelsConversational AI
A
Ai Ti Aw
Institute for Infocomm Research (I2R), A*STAR, Singapore