🤖 AI Summary
Bangla handwritten text recognition (HTR) faces dual challenges: high script complexity—including conjuncts, diacritics, and diverse handwriting variants—and severe scarcity of annotated data. To address these, we propose an efficient Bangla-specific HTR framework that replaces conventional subword tokenizers (e.g., BPE or WordPiece) with a grapheme-level tokenizer, employs a decoder-only Transformer architecture, and incorporates phoneme-aware tokenization to better capture phoneme–grapheme correspondences. The model is pretrained on large-scale synthetic data and fine-tuned on real handwritten samples. Evaluated on BanglaLekha-Isolated, BMNIST, and our newly curated BHTR benchmark, it achieves state-of-the-art performance—reducing average character error rate by 12.6% and accelerating inference by 3.2× compared to subword-based baselines. Our approach thus delivers both superior accuracy and significantly improved computational efficiency.
📝 Abstract
Despite Bengali being the sixth most spoken language in the world, handwritten text recognition (HTR) systems for Bengali remain severely underdeveloped. The complexity of Bengali script--featuring conjuncts, diacritics, and highly variable handwriting styles--combined with a scarcity of annotated datasets makes this task particularly challenging. We present GraDeT-HTR, a resource-efficient Bengali handwritten text recognition system based on a Grapheme-aware Decoder-only Transformer architecture. To address the unique challenges of Bengali script, we augment the performance of a decoder-only transformer by integrating a grapheme-based tokenizer and demonstrate that it significantly improves recognition accuracy compared to conventional subword tokenizers. Our model is pretrained on large-scale synthetic data and fine-tuned on real human-annotated samples, achieving state-of-the-art performance on multiple benchmark datasets.