Advancing Bangla Machine Translation Through Informal Datasets

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Informal language has been systematically neglected in Bangla machine translation (MT), hindering digital information access for 234 million native speakers. Method: This work introduces the first high-quality, open-source informal Bangla–English parallel corpus (180K sentence pairs), curated from social media and conversational texts. Leveraging the Transformer architecture, we integrate rigorous data cleaning, noise-robust training, and domain adaptation techniques, fine-tuning models on both Hugging Face and OpenNMT frameworks. Contribution/Results: Our approach achieves a +12.4 BLEU improvement over strong baselines on an informal test set, substantially enhancing translation fidelity for colloquial usage. This study bridges a critical gap in low-resource MT research—specifically, informal-domain translation—and advances the deployment of MT systems aligned with authentic user linguistic behavior. The corpus and models have already been adopted in multiple community-driven translation initiatives.

Technology Category

Application Category

📝 Abstract

Bangla is the sixth most widely spoken language globally, with approximately 234 million native speakers. However, progress in open-source Bangla machine translation remains limited. Most online resources are in English and often remain untranslated into Bangla, excluding millions from accessing essential information. Existing research in Bangla translation primarily focuses on formal language, neglecting the more commonly used informal language. This is largely due to the lack of pairwise Bangla-English data and advanced translation models. If datasets and models can be enhanced to better handle natural, informal Bangla, millions of people will benefit from improved online information access. In this research, we explore current state-of-the-art models and propose improvements to Bangla translation by developing a dataset from informal sources like social media and conversational texts. This work aims to advance Bangla machine translation by focusing on informal language translation and improving accessibility for Bangla speakers in the digital world.

Problem

Research questions and friction points this paper is trying to address.

Advancing Bangla machine translation using informal datasets

Addressing limited open-source resources for Bangla translation

Improving accessibility for Bangla speakers through informal language focus

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed dataset from social media and conversational texts

Focused on informal language translation for Bangla

Improved translation models to enhance online information access

🔎 Similar Papers

No similar papers found.