PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This study addresses the critical yet underexplored challenge of punctuation restoration in Persian automatic speech recognition (ASR) outputs, which commonly lack punctuation and thereby suffer from reduced readability and impaired downstream task performance. To bridge this gap, the authors construct the first large-scale, high-quality Persian punctuation restoration dataset comprising 17 million samples. They formulate the task as a token-level sequence labeling problem and achieve efficient and accurate punctuation recovery by fine-tuning a lightweight ParsBERT model. The proposed approach attains a macro-averaged F1 score of 91.33% on the test set, demonstrating a strong balance between accuracy and real-time inference capability while effectively mitigating the over-editing issues commonly associated with large language models. The code, trained models, and dataset are publicly released to foster further research in this domain.

Technology Category

Application Category

📝 Abstract

Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition (ASR) outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset of 17 million samples for Persian punctuation restoration, constructed through systematic aggregation and filtering of existing textual resources. We formulate punctuation restoration as a token-level sequence labeling task and fine-tune ParsBERT to achieve strong performance. Through comparative evaluation, we demonstrate that while large language models can perform punctuation restoration, they suffer from critical limitations: over-correction tendencies that introduce undesired edits beyond punctuation insertion (particularly problematic for speech-to-text pipelines) and substantially higher computational requirements. Our lightweight BERT-based approach achieves a macro-averaged F1 score of 91.33% on our test set while maintaining efficiency suitable for real-time applications. We make our dataset (https://huggingface.co/datasets/MohammadJRanjbar/persian-punctuation-restoration) and model (https://huggingface.co/MohammadJRanjbar/parsbert-persian-punctuation) publicly available to facilitate future research in Persian NLP and provide a scalable framework applicable to other morphologically rich, low-resource languages.

Problem

Research questions and friction points this paper is trying to address.

punctuation restoration

Persian

automatic speech recognition

low-resource languages

sequence labeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

punctuation restoration

Persian NLP

ParsBERT