🤖 AI Summary
This study addresses the low OCR accuracy and scarcity of annotated data in multilingual historical documents—including Ancient Hebrew texts, modern multilingual parliamentary records, and contemporary English handwritten manuscripts. To tackle these challenges, we propose a collaborative multi-model framework integrating document layout awareness with confidence-driven pseudo-labeling. Our method combines Kraken, TrOCR, a custom CRNN (with ResNet34-BiLSTM encoder), and DeepLabV3+ for semantic segmentation, optimized via CTC loss for sequence modeling. We introduce a dynamic-threshold pseudo-labeling mechanism and a weighted output fusion strategy to mitigate overfitting and noise propagation in low-resource settings. Evaluated across multiple cross-lingual and cross-era benchmark datasets, our approach achieves average character-level accuracy improvements of 4.2–9.7%. The framework delivers a scalable, robust, end-to-end OCR solution tailored for historical document digitization.
📝 Abstract
This paper presents our methodology and findings from three tasks across Optical Character Recognition (OCR) and Document Layout Analysis using advanced deep learning techniques. First, for the historical Hebrew fragments of the Dead Sea Scrolls, we enhanced our dataset through extensive data augmentation and employed the Kraken and TrOCR models to improve character recognition. In our analysis of 16th to 18th-century meeting resolutions task, we utilized a Convolutional Recurrent Neural Network (CRNN) that integrated DeepLabV3+ for semantic segmentation with a Bidirectional LSTM, incorporating confidence-based pseudolabeling to refine our model. Finally, for modern English handwriting recognition task, we applied a CRNN with a ResNet34 encoder, trained using the Connectionist Temporal Classification (CTC) loss function to effectively capture sequential dependencies. This report offers valuable insights and suggests potential directions for future research.