Advancing Bangla Machine Translation Through Informal Datasets

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Informal language has been systematically neglected in Bangla machine translation (MT), hindering digital information access for 234 million native speakers. Method: This work introduces the first high-quality, open-source informal Bangla–English parallel corpus (180K sentence pairs), curated from social media and conversational texts. Leveraging the Transformer architecture, we integrate rigorous data cleaning, noise-robust training, and domain adaptation techniques, fine-tuning models on both Hugging Face and OpenNMT frameworks. Contribution/Results: Our approach achieves a +12.4 BLEU improvement over strong baselines on an informal test set, substantially enhancing translation fidelity for colloquial usage. This study bridges a critical gap in low-resource MT research—specifically, informal-domain translation—and advances the deployment of MT systems aligned with authentic user linguistic behavior. The corpus and models have already been adopted in multiple community-driven translation initiatives.

Technology Category

Application Category

📝 Abstract
Bangla is the sixth most widely spoken language globally, with approximately 234 million native speakers. However, progress in open-source Bangla machine translation remains limited. Most online resources are in English and often remain untranslated into Bangla, excluding millions from accessing essential information. Existing research in Bangla translation primarily focuses on formal language, neglecting the more commonly used informal language. This is largely due to the lack of pairwise Bangla-English data and advanced translation models. If datasets and models can be enhanced to better handle natural, informal Bangla, millions of people will benefit from improved online information access. In this research, we explore current state-of-the-art models and propose improvements to Bangla translation by developing a dataset from informal sources like social media and conversational texts. This work aims to advance Bangla machine translation by focusing on informal language translation and improving accessibility for Bangla speakers in the digital world.
Problem

Research questions and friction points this paper is trying to address.

Advancing Bangla machine translation using informal datasets
Addressing limited open-source resources for Bangla translation
Improving accessibility for Bangla speakers through informal language focus
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed dataset from social media and conversational texts
Focused on informal language translation for Bangla
Improved translation models to enhance online information access
🔎 Similar Papers
No similar papers found.
Ayon Roy
Ayon Roy
University of Texas at Arlington
Artificial IntelligenceLarge Language Model(LLM)Generative AI
R
Risat Rahaman
Department of Computer Science and Engineering, BRAC University , Dhaka, Bangladesh
S
Sadat Shibly
Department of Computer Science and Engineering, BRAC University , Dhaka, Bangladesh
U
Udoy Saha Joy
Department of Computer Science and Engineering, BRAC University , Dhaka, Bangladesh
A
Abdulla Al Kafi
Department of Computer Science and Engineering, BRAC University , Dhaka, Bangladesh
F
Farig Yousuf Sadeque
Department of Computer Science and Engineering, BRAC University , Dhaka, Bangladesh