BhashaSetu: A Data-Centric Approach to Low-Resource Machine Translation

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF

career value

146K/year
🤖 AI Summary
This study addresses the scarcity of high-quality English–Marathi parallel corpora in low-resource neural machine translation by constructing a large-scale dataset comprising 2.78 million sentence pairs spanning news, politics, healthcare, literature, and culture. The pipeline integrates multi-source heterogeneous data collection, stemming and tokenization, corpus-level deduplication, and efficient fine-tuning of the NLLB-200-distilled-600M model using LoRA. Systematic evaluation demonstrates that corpus deduplication is critical for morphologically rich languages: omitting this step results in a BLEU score drop of 1.17 and a chrF++ decline of 2.21. This work presents the first linguistically enhanced English–Marathi parallel corpus, underscoring the substantial benefits of rigorous data curation in low-resource NMT settings.
📝 Abstract
We present BhashaSetu, a linguistically enriched English--Marathi parallel dataset addressing persistent data limitations in low-resource neural machine translation (NMT). Marathi, spoken by over 95 million people, remains underrepresented in high-quality parallel corpora across diverse domains. Our dataset comprises 2.78 million sentence pairs from heterogeneous sources including news, politics, healthcare, literature, and culture, with stemmed and lemmatized representations to support morphology-aware analysis. We benchmark multiple state-of-the-art translation models using BLEU, spBLEU, chrF++, and TER metrics, and conduct parameter-efficient fine-tuning of NLLB-200-distilled-600M using LoRA. A key finding from our ablation: corpus-level deduplication is the single largest preprocessing contributor to downstream quality (removing it reduces performance by 1.17 BLEU and 2.21 chrF++), demonstrating that disciplined cross-source corpus hygiene is a low-cost, high-impact intervention for low-resource, morphologically rich languages. The dataset is publicly released to promote reproducible and linguistically informed low-resource NMT research.
Problem

Research questions and friction points this paper is trying to address.

low-resource machine translation
parallel corpus
Marathi
data scarcity
morphologically rich languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

low-resource machine translation
parallel corpus
morphology-aware preprocessing
corpus deduplication
parameter-efficient fine-tuning
🔎 Similar Papers
No similar papers found.
Param Thakkar
Param Thakkar
Student
Deep learningMachine learningData structureAlgorithmsGPUs
A
Anushka Yadav
Department of Computer Engineering and Information Technology, Veermata Jijabai Technological Institute, Mumbai
M
Michael Tiemann
Tübingen AI Center, University of Tübingen, Germany
A
Abhi Mehta
Department of Computer Engineering and Information Technology, Veermata Jijabai Technological Institute, Mumbai
A
Akshita Bhasin
Department of Computer Engineering and Information Technology, Veermata Jijabai Technological Institute, Mumbai
S
Shrinivas Khedkar
Department of Computer Engineering and Information Technology, Veermata Jijabai Technological Institute, Mumbai