🤖 AI Summary
Addressing the low-resource, high-complexity task of Pashto OCR—characterized by cursive script, connected glyphs, and severe scarcity of annotated data—this work introduces PsOCR, the first large-scale, multi-granularity synthetic benchmark dataset for Pashto. PsOCR comprises over one million samples, spanning one thousand font styles, layouts, and color schemes, with hierarchical annotations at word-, line-, and document-level. Furthermore, we establish the first end-to-end Large Multimodal Model (LMM)-based OCR evaluation framework tailored to Arabic-script languages. We systematically benchmark 11 state-of-the-art LMMs, revealing Gemini as the top-performing model overall and Qwen-7B as the best open-weight alternative. Our analysis validates the efficacy of hybrid CNN–Transformer architectural adaptations for Arabic-script OCR. PsOCR is publicly released, substantially bridging critical gaps in both data resources and standardized evaluation for Pashto, as well as related Arabic-script languages including Arabic, Persian, and Urdu.
📝 Abstract
This paper evaluates the performance of Large Multimodal Models (LMMs) on Optical Character Recognition (OCR) in the low-resource Pashto language. Natural Language Processing (NLP) in Pashto faces several challenges due to the cursive nature of its script and a scarcity of structured datasets. To address this, we developed a synthetic Pashto OCR dataset, PsOCR, consisting of one million images annotated with bounding boxes at word, line, and document levels, suitable for training and evaluating models based on different architectures, including Convolutional Neural Networks (CNNs) and Transformers. PsOCR covers variations across 1,000 unique font families, colors, image sizes, and layouts. A benchmark subset of 10K images was selected to evaluate the performance of several LMMs, including seven open-source models: DeepSeek's Janus, InternVL, MiniCPM, Florence, and Qwen (3B and 7B), and four closed-source models: GPT-4o, Gemini, Claude, and Grok. Experimental results demonstrate that Gemini achieves the best performance among all models, whereas among open-source models, Qwen-7B stands out. This work provides an insightful assessment of the capabilities and limitations of current LMMs for OCR tasks in Pashto and establishes a foundation for further research not only in Pashto OCR but also for other similar scripts such as Arabic, Persian, and Urdu. PsOCR is available at https://github.com/zirak-ai/PashtoOCR.