PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto Language

📅 2025-05-15

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Addressing the low-resource, high-complexity task of Pashto OCR—characterized by cursive script, connected glyphs, and severe scarcity of annotated data—this work introduces PsOCR, the first large-scale, multi-granularity synthetic benchmark dataset for Pashto. PsOCR comprises over one million samples, spanning one thousand font styles, layouts, and color schemes, with hierarchical annotations at word-, line-, and document-level. Furthermore, we establish the first end-to-end Large Multimodal Model (LMM)-based OCR evaluation framework tailored to Arabic-script languages. We systematically benchmark 11 state-of-the-art LMMs, revealing Gemini as the top-performing model overall and Qwen-7B as the best open-weight alternative. Our analysis validates the efficacy of hybrid CNN–Transformer architectural adaptations for Arabic-script OCR. PsOCR is publicly released, substantially bridging critical gaps in both data resources and standardized evaluation for Pashto, as well as related Arabic-script languages including Arabic, Persian, and Urdu.

Technology Category

Application Category

📝 Abstract

This paper evaluates the performance of Large Multimodal Models (LMMs) on Optical Character Recognition (OCR) in the low-resource Pashto language. Natural Language Processing (NLP) in Pashto faces several challenges due to the cursive nature of its script and a scarcity of structured datasets. To address this, we developed a synthetic Pashto OCR dataset, PsOCR, consisting of one million images annotated with bounding boxes at word, line, and document levels, suitable for training and evaluating models based on different architectures, including Convolutional Neural Networks (CNNs) and Transformers. PsOCR covers variations across 1,000 unique font families, colors, image sizes, and layouts. A benchmark subset of 10K images was selected to evaluate the performance of several LMMs, including seven open-source models: DeepSeek's Janus, InternVL, MiniCPM, Florence, and Qwen (3B and 7B), and four closed-source models: GPT-4o, Gemini, Claude, and Grok. Experimental results demonstrate that Gemini achieves the best performance among all models, whereas among open-source models, Qwen-7B stands out. This work provides an insightful assessment of the capabilities and limitations of current LMMs for OCR tasks in Pashto and establishes a foundation for further research not only in Pashto OCR but also for other similar scripts such as Arabic, Persian, and Urdu. PsOCR is available at https://github.com/zirak-ai/PashtoOCR.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LMMs for OCR in low-resource Pashto language

Addressing Pashto NLP challenges with synthetic dataset PsOCR

Benchmarking LMMs' performance on Pashto OCR tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed synthetic Pashto OCR dataset PsOCR

Evaluated LMMs including Gemini and Qwen-7B

Covered 1,000 font families and layouts

🔎 Similar Papers

No similar papers found.