🤖 AI Summary
Low-resource Bangla suffers from poor readability and degraded downstream performance (e.g., in ASR) due to frequent punctuation omission. To address this, we propose the first large-scale, multi-domain punctuation restoration method specifically for Bangla. We construct a high-quality, manually annotated corpus spanning news, literary, and conversational domains, and introduce a data augmentation strategy combining back-translation and controlled noise injection. Our model employs XLM-RoBERTa-large as the encoder within a sequence-labeling framework enhanced with a Conditional Random Field (CRF) decoder to jointly restore periods, commas, question marks, and exclamation marks in an end-to-end manner. On the news test set, it achieves 97.1% accuracy; on real ASR output texts, it maintains 90.2% accuracy—substantially outperforming all baselines. This work establishes the first open-source Bangla punctuation restoration benchmark, including a curated dataset, effective augmentation techniques, and a robust, generalizable model—thereby filling a critical gap in low-resource language punctuation recovery research.
📝 Abstract
Punctuation restoration enhances the readability of text and is critical for post-processing tasks in Automatic Speech Recognition (ASR), especially for low-resource languages like Bangla. In this study, we explore the application of transformer-based models, specifically XLM-RoBERTa-large, to automatically restore punctuation in unpunctuated Bangla text. We focus on predicting four punctuation marks: period, comma, question mark, and exclamation mark across diverse text domains. To address the scarcity of annotated resources, we constructed a large, varied training corpus and applied data augmentation techniques. Our best-performing model, trained with an augmentation factor of alpha = 0.20%, achieves an accuracy of 97.1% on the News test set, 91.2% on the Reference set, and 90.2% on the ASR set.
Results show strong generalization to reference and ASR transcripts, demonstrating the model's effectiveness in real-world, noisy scenarios. This work establishes a strong baseline for Bangla punctuation restoration and contributes publicly available datasets and code to support future research in low-resource NLP.