UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited scalability of conventional grapheme-to-phoneme (G2P)-based text-to-speech (TTS) systems, which struggle to support hundreds of languages due to dependence on language-specific resources. The authors propose a unified multilingual romanization scheme coupled with a phonetically informed text encoding approach. For the first time, they integrate this universal romanization with a phoneme-aware pretraining objective into a BERT-based architecture, yielding a large-scale multilingual TTS text encoder capable of handling 495 languages. By circumventing traditional G2P dependencies, the method substantially enhances cross-lingual generalization and consistently outperforms existing encoders across diverse languages and low-resource settings, demonstrating remarkable adaptability even to unseen languages.
📝 Abstract
We propose UR-BERT, a Romanized transcription-based text-to-speech (TTS) encoder for massively multilingual TTS systems. Conventional grapheme-to-phoneme (G2P)-based approaches are limited to around 100 languages due to the availability of reliable G2P resources. In contrast, UR-BERT scales to 495 languages by unifying diverse writing systems into a shared Romanization representation. To further enhance phonetic fidelity and text-speech alignment, we introduce a speech token prediction objective during training, which encourages the encoder to learn speech-aware phonetic representations in a data-efficient manner. Experiments show that TTS systems built on UR-BERT consistently outperform recent text encoder baselines across a wide range of languages and resource conditions, and demonstrate strong generalization to unseen languages.
Problem

Research questions and friction points this paper is trying to address.

multilingual TTS
grapheme-to-phoneme
language scalability
text-to-speech
low-resource languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal Romanization
Speech Token Prediction
Massively Multilingual TTS
Text Encoder
Phonetic Representation
🔎 Similar Papers
No similar papers found.