🤖 AI Summary
Addressing the challenge of multilingual text-to-speech (TTS) synthesis across 1,369 languages and 13 scripts in the Indian subcontinent, this work introduces the first end-to-end TTS framework tailored for everyday code-switching. Methodologically, we propose a phonology-based universal label set (CLS), enabling script- and language-agnostic input representation through phonology-driven text normalization and cross-lingual mapping—eliminating reliance on large, language-specific phoneme or word inventories. A single acoustic model trained on this unified representation supports seamless, speaker-consistent synthesis across 13 Indian languages and English within one voice. Experiments demonstrate state-of-the-art performance in naturalness (MOS), code-switching accuracy, and cross-lingual generalization, while significantly improving prosodic continuity and speaker identity preservation during intra-sentence language switching.
📝 Abstract
India has 1369 languages of which 22 are official. About 13 different scripts are used to represent these languages. A Common Label Set (CLS) was developed based on phonetics to address the issue of large vocabulary of units required in the End-to-End (E2E) framework for multilingual synthesis. The Indian language text is first converted to CLS. This approach enables seamless code switching across 13 Indian languages and English in a given native speaker's voice, which corresponds to everyday speech in the Indian subcontinent, where the population is multilingual.