MixedG2P-T5: G2P-free Speech Synthesis for Mixed-script texts using Speech Self-Supervised Learning and Language Model

📅 2025-09-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high cost and poor scalability of grapheme-to-phoneme (G2P) conversion in multilingual mixed-text speech synthesis (e.g., Chinese–Japanese text), this paper proposes a G2P-free end-to-end TTS framework. Instead of phoneme-based modeling, the method leverages self-supervised speech models (e.g., wav2vec 2.0) to extract discrete speech units and jointly employs a T5 encoder to map raw mixed-script input directly to speech token sequences. A pseudo-language tagging mechanism replaces manual phonemic annotation, substantially reducing reliance on high-quality phonetic transcriptions. Experiments demonstrate that the proposed approach matches G2P-based baselines in naturalness, prosody, and accent modeling, while preserving both linguistic and paralinguistic characteristics. This work establishes an efficient, scalable paradigm for modeling large-scale, unlabeled multilingual speech data without language-specific phonological resources.

Technology Category

Application Category

📝 Abstract
This study presents a novel approach to voice synthesis that can substitute the traditional grapheme-to-phoneme (G2P) conversion by using a deep learning-based model that generates discrete tokens directly from speech. Utilizing a pre-trained voice SSL model, we train a T5 encoder to produce pseudo-language labels from mixed-script texts (e.g., containing Kanji and Kana). This method eliminates the need for manual phonetic transcription, reducing costs and enhancing scalability, especially for large non-transcribed audio datasets. Our model matches the performance of conventional G2P-based text-to-speech systems and is capable of synthesizing speech that retains natural linguistic and paralinguistic features, such as accents and intonations.
Problem

Research questions and friction points this paper is trying to address.

Eliminates manual phonetic transcription for speech synthesis
Handles mixed-script texts without grapheme-to-phoneme conversion
Generates speech with natural accents and intonations directly
Innovation

Methods, ideas, or system contributions that make the work stand out.

G2P-free synthesis using deep learning
SSL model generates pseudo-language labels
Eliminates manual phonetic transcription costs
🔎 Similar Papers
No similar papers found.
J
Joonyong Park
The University of Tokyo, Japan
D
Daisuke Saito
The University of Tokyo, Japan
Nobuaki Minematsu
Nobuaki Minematsu
The University of Tokyo
Speech CommunicationForeign Language Learning