Towards Building Large Scale Datasets and State-of-the-Art Automatic Speech Translation Systems for 14 Indian Languages

📅 2024-11-07

📈 Citations: 2

✨ Influential: 1

career value

196K/year

🤖 AI Summary

Long-standing limitations in speech translation for Indian languages stem from the scarcity of large-scale, linguistically diverse, and multi-domain publicly available datasets. To address this, we introduce BhasaAnuvaad—the largest open speech translation dataset for Indian languages to date—comprising 44K hours of audio and 17M text pairs across 14 Indian languages and English. Our methodology integrates three complementary strategies: (i) aggregation of authentic spoken data, (ii) joint web crawling across multiple domains, and (iii) TTS- and ASR-driven synthetic speech modeling—ensuring broad language coverage, domain diversity, and phonetic authenticity. We further propose IndicSeamless, the first end-to-end unified speech translation framework tailored for Indian languages, and fully open-source the dataset, model weights, and training code under a permissive license. Evaluated on 14 languages, our approach achieves substantial improvements over prior state-of-the-art, with BLEU gains of +3.2–+5.8. BhasaAnuvaad has been widely adopted by the research community, advancing equitable and reproducible low-resource speech translation.

Technology Category

Application Category

📝 Abstract

Speech translation for Indian languages remains a challenging task due to the scarcity of large-scale, publicly available datasets that capture the linguistic diversity and domain coverage essential for real-world applications. Existing datasets cover a fraction of Indian languages and lack the breadth needed to train robust models that generalize beyond curated benchmarks. To bridge this gap, we introduce BhasaAnuvaad, the largest speech translation dataset for Indian languages, spanning over 44 thousand hours of audio and 17 million aligned text segments across 14 Indian languages and English. Our dataset is built through a threefold methodology: (a) aggregating high-quality existing sources, (b) large-scale web crawling to ensure linguistic and domain diversity, and (c) creating synthetic data to model real-world speech disfluencies. Leveraging BhasaAnuvaad, we train IndicSeamless, a state-of-the-art speech translation model for Indian languages that performs better than existing models. Our experiments demonstrate improvements in the translation quality, setting a new standard for Indian language speech translation. We will release all the code, data and model weights in the open-source, with permissive licenses to promote accessibility and collaboration.

Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale speech translation datasets for Indian languages

Existing datasets insufficient for robust, generalizable model training

Need for diverse, high-quality data to improve translation accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest speech translation dataset for Indian languages

Threefold methodology for dataset creation

State-of-the-art speech translation model IndicSeamless

🔎 Similar Papers

No similar papers found.