UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limited scalability of conventional grapheme-to-phoneme (G2P)-based text-to-speech (TTS) systems, which struggle to support hundreds of languages due to dependence on language-specific resources. The authors propose a unified multilingual romanization scheme coupled with a phonetically informed text encoding approach. For the first time, they integrate this universal romanization with a phoneme-aware pretraining objective into a BERT-based architecture, yielding a large-scale multilingual TTS text encoder capable of handling 495 languages. By circumventing traditional G2P dependencies, the method substantially enhances cross-lingual generalization and consistently outperforms existing encoders across diverse languages and low-resource settings, demonstrating remarkable adaptability even to unseen languages.

📝 Abstract

We propose UR-BERT, a Romanized transcription-based text-to-speech (TTS) encoder for massively multilingual TTS systems. Conventional grapheme-to-phoneme (G2P)-based approaches are limited to around 100 languages due to the availability of reliable G2P resources. In contrast, UR-BERT scales to 495 languages by unifying diverse writing systems into a shared Romanization representation. To further enhance phonetic fidelity and text-speech alignment, we introduce a speech token prediction objective during training, which encourages the encoder to learn speech-aware phonetic representations in a data-efficient manner. Experiments show that TTS systems built on UR-BERT consistently outperform recent text encoder baselines across a wide range of languages and resource conditions, and demonstrate strong generalization to unseen languages.

Problem

Research questions and friction points this paper is trying to address.

multilingual TTS

grapheme-to-phoneme

language scalability

text-to-speech

low-resource languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal Romanization

Speech Token Prediction

Massively Multilingual TTS