🤖 AI Summary
To address the high cost and poor scalability of grapheme-to-phoneme (G2P) conversion in multilingual mixed-text speech synthesis (e.g., Chinese–Japanese text), this paper proposes a G2P-free end-to-end TTS framework. Instead of phoneme-based modeling, the method leverages self-supervised speech models (e.g., wav2vec 2.0) to extract discrete speech units and jointly employs a T5 encoder to map raw mixed-script input directly to speech token sequences. A pseudo-language tagging mechanism replaces manual phonemic annotation, substantially reducing reliance on high-quality phonetic transcriptions. Experiments demonstrate that the proposed approach matches G2P-based baselines in naturalness, prosody, and accent modeling, while preserving both linguistic and paralinguistic characteristics. This work establishes an efficient, scalable paradigm for modeling large-scale, unlabeled multilingual speech data without language-specific phonological resources.
📝 Abstract
This study presents a novel approach to voice synthesis that can substitute the traditional grapheme-to-phoneme (G2P) conversion by using a deep learning-based model that generates discrete tokens directly from speech. Utilizing a pre-trained voice SSL model, we train a T5 encoder to produce pseudo-language labels from mixed-script texts (e.g., containing Kanji and Kana). This method eliminates the need for manual phonetic transcription, reducing costs and enhancing scalability, especially for large non-transcribed audio datasets. Our model matches the performance of conventional G2P-based text-to-speech systems and is capable of synthesizing speech that retains natural linguistic and paralinguistic features, such as accents and intonations.