🤖 AI Summary
This work addresses the limited scalability of conventional grapheme-to-phoneme (G2P)-based text-to-speech (TTS) systems, which struggle to support hundreds of languages due to dependence on language-specific resources. The authors propose a unified multilingual romanization scheme coupled with a phonetically informed text encoding approach. For the first time, they integrate this universal romanization with a phoneme-aware pretraining objective into a BERT-based architecture, yielding a large-scale multilingual TTS text encoder capable of handling 495 languages. By circumventing traditional G2P dependencies, the method substantially enhances cross-lingual generalization and consistently outperforms existing encoders across diverse languages and low-resource settings, demonstrating remarkable adaptability even to unseen languages.
📝 Abstract
We propose UR-BERT, a Romanized transcription-based text-to-speech (TTS) encoder for massively multilingual TTS systems. Conventional grapheme-to-phoneme (G2P)-based approaches are limited to around 100 languages due to the availability of reliable G2P resources. In contrast, UR-BERT scales to 495 languages by unifying diverse writing systems into a shared Romanization representation. To further enhance phonetic fidelity and text-speech alignment, we introduce a speech token prediction objective during training, which encourages the encoder to learn speech-aware phonetic representations in a data-efficient manner. Experiments show that TTS systems built on UR-BERT consistently outperform recent text encoder baselines across a wide range of languages and resource conditions, and demonstrate strong generalization to unseen languages.