Comparative analysis of optical character recognition methods for S'ami texts from the National Library of Norway

📅 2025-01-13

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This study addresses the low OCR accuracy and scarcity of annotated data for Sámi-language documents in the National Library of Norway. We conduct the first systematic comparative evaluation of Transkribus, Tesseract, and TrOCR, and propose a collaborative data augmentation strategy integrating human annotations, machine-generated labels, and synthetically rendered text images. Leveraging transfer learning fine-tuning, semi-automatic annotation expansion, and cross-domain generalization assessment, we significantly improve recognition performance under low-resource conditions. Results show that Transkribus and TrOCR substantially outperform Tesseract on in-domain library collections; high accuracy is achieved with only moderate-scale human annotation; conversely, Tesseract exhibits superior out-of-domain generalization. Our work establishes a reusable methodological framework and provides empirical validation for high-quality OCR of endangered-language digital humanities resources.

Technology Category

Application Category

📝 Abstract

Optical Character Recognition (OCR) is crucial to the National Library of Norway's (NLN) digitisation process as it converts scanned documents into machine-readable text. However, for the S'ami documents in NLN's collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in S'ami languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing S'ami texts from NLN's collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an out-of-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machine annotations and synthetic text images can yield accurate OCR for S'ami languages, even with a moderate amount of manually annotated data.

Problem

Research questions and friction points this paper is trying to address.

OCR accuracy

Sami language documents

Digitalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

OCR accuracy improvement

Automatic annotation

Computer-generated text images

🔎 Similar Papers

A Cross-Font Image Retrieval Network for Recognizing Undeciphered Oracle Bone Inscriptions