Improving OCR for Historical Texts of Multiple Languages

📅 2025-08-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the low OCR accuracy and scarcity of annotated data in multilingual historical documents—including Ancient Hebrew texts, modern multilingual parliamentary records, and contemporary English handwritten manuscripts. To tackle these challenges, we propose a collaborative multi-model framework integrating document layout awareness with confidence-driven pseudo-labeling. Our method combines Kraken, TrOCR, a custom CRNN (with ResNet34-BiLSTM encoder), and DeepLabV3+ for semantic segmentation, optimized via CTC loss for sequence modeling. We introduce a dynamic-threshold pseudo-labeling mechanism and a weighted output fusion strategy to mitigate overfitting and noise propagation in low-resource settings. Evaluated across multiple cross-lingual and cross-era benchmark datasets, our approach achieves average character-level accuracy improvements of 4.2–9.7%. The framework delivers a scalable, robust, end-to-end OCR solution tailored for historical document digitization.

Technology Category

Application Category

📝 Abstract
This paper presents our methodology and findings from three tasks across Optical Character Recognition (OCR) and Document Layout Analysis using advanced deep learning techniques. First, for the historical Hebrew fragments of the Dead Sea Scrolls, we enhanced our dataset through extensive data augmentation and employed the Kraken and TrOCR models to improve character recognition. In our analysis of 16th to 18th-century meeting resolutions task, we utilized a Convolutional Recurrent Neural Network (CRNN) that integrated DeepLabV3+ for semantic segmentation with a Bidirectional LSTM, incorporating confidence-based pseudolabeling to refine our model. Finally, for modern English handwriting recognition task, we applied a CRNN with a ResNet34 encoder, trained using the Connectionist Temporal Classification (CTC) loss function to effectively capture sequential dependencies. This report offers valuable insights and suggests potential directions for future research.
Problem

Research questions and friction points this paper is trying to address.

Enhancing OCR accuracy for historical Hebrew texts
Improving document layout analysis for 16th-18th century resolutions
Recognizing modern English handwriting with deep learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhanced Hebrew OCR with Kraken and TrOCR
CRNN with DeepLabV3+ for historical documents
ResNet34 CRNN with CTC for handwriting
🔎 Similar Papers
No similar papers found.
H
Hylke Westerdijk
University of Groningen
B
Ben Blankenborg
University of Groningen
Khondoker Ittehadul Islam
Khondoker Ittehadul Islam
University of Groningen
Natural Language Processing