🤖 AI Summary
This study addresses a critical methodological gap in automatic text recognition (ATR) of ancient scripts, where outputs often fail to balance paleographic fidelity with modern readability. To bridge this divide, the paper introduces the “Pre-Editing Normalization” (PEN) task, formally defining the problem of transforming ATR transcriptions into standardized texts according to editorial conventions while preserving intermediate paleographic representations. The authors construct a silver-standard training set of 4.66 million samples and a gold-standard evaluation set of 1,800 expert-corrected instances. Leveraging a ByT5-based sequence-to-sequence architecture and incorporating aligned Old French–Latin texts from the CoMMA corpus and passim alignments, the proposed model achieves a character error rate (CER) of 6.7% on the PEN task, substantially outperforming existing approaches and effectively reconciling ancient script transcription with digital scholarly editing.
📝 Abstract
Recent advances in Automatic Text Recognition (ATR) have improved access to historical archives, yet a methodological divide persists between palaeographic transcriptions and normalized digital editions. While ATR models trained on more palaeographically-oriented datasets such as CATMuS have shown greater generalizability, their raw outputs remain poorly compatible with most readers and downstream NLP tools, thus creating a usability gap. On the other hand, ATR models trained to produce normalized outputs have been shown to struggle to adapt to new domains and tend to over-normalize and hallucinate. We introduce the task of Pre-Editorial Normalization (PEN), which consists in normalizing graphemic ATR output according to editorial conventions, which has the advantage of keeping an intermediate step with palaeographic fidelity while providing a normalized version for practical usability. We present a new dataset derived from the CoMMA corpus and aligned with digitized Old French and Latin editions using passim. We also produce a manually corrected gold-standard evaluation set. We benchmark this resource using ByT5-based sequence-to-sequence models on normalization and pre-annotation tasks. Our contributions include the formal definition of PEN, a 4.66M-sample silver training corpus, a 1.8k-sample gold evaluation set, and a normalization model achieving a 6.7% CER, substantially outperforming previous models for this task.