🤖 AI Summary
This study addresses the poor zero-shot OCR recognition performance for low-resource languages—Sinhala and Tamil. We conduct the first systematic evaluation of six mainstream OCR engines (Cloud Vision API, Surya, Document AI, Tesseract, Subasa OCR, EasyOCR) on character- and word-level zero-shot tasks. We introduce the first synthetic Tamil OCR benchmark dataset and perform cross-engine comparative analysis using five quantitative metrics. Results show Surya achieves the best word error rate (2.61%) on Sinhala, while Document AI attains the lowest character error rate (0.78%) on Tamil. Our contributions are threefold: (1) the first cross-engine zero-shot OCR benchmark for Sinhala and Tamil; (2) an open-source, high-quality synthetic Tamil OCR test set; and (3) an empirical characterization of performance boundaries and practical applicability of commercial versus open-source OCR engines for low-resource languages.
📝 Abstract
Solving the problem of Optical Character Recognition (OCR) on printed text for Latin and its derivative scripts can now be considered settled due to the volumes of research done on English and other High-Resourced Languages (HRL). However, for Low-Resourced Languages (LRL) that use unique scripts, it remains an open problem. This study presents a comparative analysis of the zero-shot performance of six distinct OCR engines on two LRLs: Sinhala and Tamil. The selected engines include both commercial and open-source systems, aiming to evaluate the strengths of each category. The Cloud Vision API, Surya, Document AI, and Tesseract were evaluated for both Sinhala and Tamil, while Subasa OCR and EasyOCR were examined for only one language due to their limitations. The performance of these systems was rigorously analysed using five measurement techniques to assess accuracy at both the character and word levels. According to the findings, Surya delivered the best performance for Sinhala across all metrics, with a WER of 2.61%. Conversely, Document AI excelled across all metrics for Tamil, highlighted by a very low CER of 0.78%. In addition to the above analysis, we also introduce a novel synthetic Tamil OCR benchmarking dataset.