🤖 AI Summary
This study addresses the scarcity of high-quality English–Marathi parallel corpora in low-resource neural machine translation by constructing a large-scale dataset comprising 2.78 million sentence pairs spanning news, politics, healthcare, literature, and culture. The pipeline integrates multi-source heterogeneous data collection, stemming and tokenization, corpus-level deduplication, and efficient fine-tuning of the NLLB-200-distilled-600M model using LoRA. Systematic evaluation demonstrates that corpus deduplication is critical for morphologically rich languages: omitting this step results in a BLEU score drop of 1.17 and a chrF++ decline of 2.21. This work presents the first linguistically enhanced English–Marathi parallel corpus, underscoring the substantial benefits of rigorous data curation in low-resource NMT settings.
📝 Abstract
We present BhashaSetu, a linguistically enriched English--Marathi parallel dataset addressing persistent data limitations in low-resource neural machine translation (NMT). Marathi, spoken by over 95 million people, remains underrepresented in high-quality parallel corpora across diverse domains. Our dataset comprises 2.78 million sentence pairs from heterogeneous sources including news, politics, healthcare, literature, and culture, with stemmed and lemmatized representations to support morphology-aware analysis. We benchmark multiple state-of-the-art translation models using BLEU, spBLEU, chrF++, and TER metrics, and conduct parameter-efficient fine-tuning of NLLB-200-distilled-600M using LoRA. A key finding from our ablation: corpus-level deduplication is the single largest preprocessing contributor to downstream quality (removing it reduces performance by 1.17 BLEU and 2.21 chrF++), demonstrating that disciplined cross-source corpus hygiene is a low-cost, high-impact intervention for low-resource, morphologically rich languages. The dataset is publicly released to promote reproducible and linguistically informed low-resource NMT research.