Comparative analysis of optical character recognition methods for S'ami texts from the National Library of Norway

📅 2025-01-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the low OCR accuracy and scarcity of annotated data for Sámi-language documents in the National Library of Norway. We conduct the first systematic comparative evaluation of Transkribus, Tesseract, and TrOCR, and propose a collaborative data augmentation strategy integrating human annotations, machine-generated labels, and synthetically rendered text images. Leveraging transfer learning fine-tuning, semi-automatic annotation expansion, and cross-domain generalization assessment, we significantly improve recognition performance under low-resource conditions. Results show that Transkribus and TrOCR substantially outperform Tesseract on in-domain library collections; high accuracy is achieved with only moderate-scale human annotation; conversely, Tesseract exhibits superior out-of-domain generalization. Our work establishes a reusable methodological framework and provides empirical validation for high-quality OCR of endangered-language digital humanities resources.

Technology Category

Application Category

📝 Abstract
Optical Character Recognition (OCR) is crucial to the National Library of Norway's (NLN) digitisation process as it converts scanned documents into machine-readable text. However, for the S'ami documents in NLN's collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in S'ami languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing S'ami texts from NLN's collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an out-of-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machine annotations and synthetic text images can yield accurate OCR for S'ami languages, even with a moderate amount of manually annotated data.
Problem

Research questions and friction points this paper is trying to address.

OCR accuracy
Sami language documents
Digitalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

OCR accuracy improvement
Automatic annotation
Computer-generated text images
🔎 Similar Papers
No similar papers found.