Restoring Rhythm: Punctuation Restoration Using Transformer Models for Bangla, a Low-Resource Language

📅 2025-07-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Low-resource Bangla suffers from poor readability and degraded downstream performance (e.g., in ASR) due to frequent punctuation omission. To address this, we propose the first large-scale, multi-domain punctuation restoration method specifically for Bangla. We construct a high-quality, manually annotated corpus spanning news, literary, and conversational domains, and introduce a data augmentation strategy combining back-translation and controlled noise injection. Our model employs XLM-RoBERTa-large as the encoder within a sequence-labeling framework enhanced with a Conditional Random Field (CRF) decoder to jointly restore periods, commas, question marks, and exclamation marks in an end-to-end manner. On the news test set, it achieves 97.1% accuracy; on real ASR output texts, it maintains 90.2% accuracy—substantially outperforming all baselines. This work establishes the first open-source Bangla punctuation restoration benchmark, including a curated dataset, effective augmentation techniques, and a robust, generalizable model—thereby filling a critical gap in low-resource language punctuation recovery research.

Technology Category

Application Category

📝 Abstract
Punctuation restoration enhances the readability of text and is critical for post-processing tasks in Automatic Speech Recognition (ASR), especially for low-resource languages like Bangla. In this study, we explore the application of transformer-based models, specifically XLM-RoBERTa-large, to automatically restore punctuation in unpunctuated Bangla text. We focus on predicting four punctuation marks: period, comma, question mark, and exclamation mark across diverse text domains. To address the scarcity of annotated resources, we constructed a large, varied training corpus and applied data augmentation techniques. Our best-performing model, trained with an augmentation factor of alpha = 0.20%, achieves an accuracy of 97.1% on the News test set, 91.2% on the Reference set, and 90.2% on the ASR set. Results show strong generalization to reference and ASR transcripts, demonstrating the model's effectiveness in real-world, noisy scenarios. This work establishes a strong baseline for Bangla punctuation restoration and contributes publicly available datasets and code to support future research in low-resource NLP.
Problem

Research questions and friction points this paper is trying to address.

Punctuation restoration in low-resource Bangla text
Transformer models for automatic punctuation prediction
Addressing data scarcity with augmented training corpus
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer model XLM-RoBERTa-large for punctuation restoration
Data augmentation to address low-resource language constraints
Public datasets and code for Bangla NLP research
M
Md Obyedullahil Mamun
Bangladesh Army International University of Science and Technology (BAIUST), Cumilla, Bangladesh
Md Adyelullahil Mamun
Md Adyelullahil Mamun
Brac University
Deep Learning
A
Arif Ahmad
North East University Bangladesh (NEUB), Sylhet, Bangladesh
M
Md. Imran Hossain Emu
Bangladesh Army International University of Science and Technology (BAIUST), Cumilla, Bangladesh