Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of distinguishing “repetitive disfluencies” (unintentional speech errors) from “morphological reduplication” (intentional, semantically motivated word formation) in Bangla automatic speech recognition (ASR) transcripts. To overcome the scarcity of annotated data, we introduce the first publicly available, linguistically fine-grained corpus of 20,000 Bangla utterances exhibiting repetition phenomena and define a linguistics-informed benchmark task. Methodologically, we integrate linguistic constraints with deep learning: (i) a few-shot prompting strategy leveraging multilingual large language models, and (ii) task-specific fine-tuning of BanglaBERT. Experiments show that the fine-tuned model achieves 84.78% accuracy and an F1-score of 0.677—substantially outperforming few-shot LLM baselines—and establishes the first strong baseline for this task. Our core contribution lies in the first systematic linguistic characterization and computational modeling of the semantic distinctions between these two repetition types, enabling high-fidelity ASR post-processing.

Technology Category

Application Category

📝 Abstract
Automatic Speech Recognition (ASR) transcripts, especially in low-resource languages like Bangla, contain a critical ambiguity: word-word repetitions can be either Repetition Disfluency (unintentional ASR error/hesitation) or Morphological Reduplication (a deliberate grammatical construct). Standard disfluency correction fails by erroneously deleting valid linguistic information. To solve this, we introduce the first publicly available, 20,000-row Bangla corpus, manually annotated to explicitly distinguish between these two phenomena in noisy ASR transcripts. We benchmark this novel resource using two paradigms: state-of-the-art multilingual Large Language Models (LLMs) and task-specific fine-tuning of encoder models. LLMs achieve competitive performance (up to 82.68% accuracy) with few-shot prompting. However, fine-tuning proves superior, with the language-specific BanglaBERT model achieving the highest accuracy of 84.78% and an F1 score of 0.677. This establishes a strong, linguistically-informed baseline and provides essential data for developing sophisticated, semantic-preserving text normalization systems for Bangla.
Problem

Research questions and friction points this paper is trying to address.

Distinguishing repetition disfluency from morphological reduplication in Bangla ASR
Solving ambiguity in low-resource language speech recognition transcripts
Providing annotated corpus and benchmarks for Bangla text normalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created first annotated Bangla corpus distinguishing disfluency from reduplication
Benchmarked multilingual LLMs and fine-tuned encoder models
Achieved best results with fine-tuned BanglaBERT model
🔎 Similar Papers
No similar papers found.
Z
Zaara Zabeen Arpa
Department of Computer Science and Engineering, Islamic University of Technology, Board Bazar, Gazipur, 1704, Dhaka, Bangladesh.
S
Sadnam Sakib Apurbo
Department of Computer Science and Engineering, Islamic University of Technology, Board Bazar, Gazipur, 1704, Dhaka, Bangladesh.
N
Nazia Karim Khan Oishee
Department of Computer Science and Engineering, Islamic University of Technology, Board Bazar, Gazipur, 1704, Dhaka, Bangladesh.
Ajwad Abrar
Ajwad Abrar
Junior Lecturer, IUT
Natural Language ProcessingHuman Computer InteractionSoftware Engineering