Kinship in Speech: Leveraging Linguistic Relatedness for Zero-Shot TTS in Indian Languages

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
India hosts 1,369 languages, 22 official languages, and 13 writing systems, yet most lack sufficient digital speech resources for conventional TTS training. To address this extreme low-resource challenge—particularly for languages with divergent orthographies and phonological structures (e.g., Sanskrit, Konkani, Maithili, Kurukh)—this paper proposes a zero-shot cross-lingual text-to-speech framework. Our method comprises three key components: (1) a multilingual shared phoneme representation space; (2) a phonology-aware dynamic text parsing mechanism that enables adaptive orthography-to-phonology mapping for target languages; and (3) zero-shot acoustic modeling integrated with a neural vocoder. Experiments demonstrate high intelligibility and naturalness in synthesized speech—even when no target-language speech data is available. Both objective metrics and subjective listening tests confirm the framework’s effectiveness, establishing a viable pathway for TTS deployment across highly under-resourced, typologically diverse languages.

Technology Category

Application Category

📝 Abstract
Text-to-speech (TTS) systems typically require high-quality studio data and accurate transcriptions for training. India has 1369 languages, with 22 official using 13 scripts. Training a TTS system for all these languages, most of which have no digital resources, seems a Herculean task. Our work focuses on zero-shot synthesis, particularly for languages whose scripts and phonotactics come from different families. The novelty of our work is in the augmentation of a shared phone representation and modifying the text parsing rules to match the phonotactics of the target language, thus reducing the synthesiser overhead and enabling rapid adaptation. Intelligible and natural speech was generated for Sanskrit, Maharashtrian and Canara Konkani, Maithili and Kurukh by leveraging linguistic connections across languages with suitable synthesisers. Evaluations confirm the effectiveness of this approach, highlighting its potential to expand speech technology access for under-represented languages.
Problem

Research questions and friction points this paper is trying to address.

Zero-shot TTS for Indian languages lacking digital resources
Adapting shared phone representation for diverse scripts and phonotactics
Enabling intelligible speech synthesis for underrepresented language families
Innovation

Methods, ideas, or system contributions that make the work stand out.

Augmenting shared phone representation for zero-shot TTS
Modifying text parsing rules for target phonotactics
Leveraging linguistic connections across Indian languages
🔎 Similar Papers
No similar papers found.
Utkarsh Pathak
Utkarsh Pathak
Research Scholar at Speech and Music Lab, IIT Madras
Text to speechZero-shot SpeechSpeech EnhancementIndicTTS
C
Chandra Sai Krishna Gunda
Indian Institute of Technology, Madras, India
Anusha Prakash
Anusha Prakash
Indian Institute of Technology Madras
Speech synthesisdysarthric speech
K
Keshav Agarwal
Indian Institute of Technology, Madras, India
H
H. Murthy
Indian Institute of Technology, Madras, India