🤖 AI Summary
Historical texts suffer from extreme orthographic variability due to nonstandardized spelling conventions and diachronic language change, severely impeding humanities research. To address this, we propose two large language model (LLM)-based spelling normalization methods: (1) unsupervised LLM fine-tuning—first exploring its feasibility for cross-temporal, multilingual spelling alignment—and (2) a supervised neural machine translation (NMT)-based mapping approach. Both automatically convert historical texts into modern standardized orthography. Experiments span multilingual, multi-period historical corpora. Results show that the NMT paradigm achieves superior overall accuracy, while the unsupervised LLM method demonstrates strong generalization capacity and robustness under low-resource conditions. Our work systematically delineates the applicability boundaries of these distinct technical pathways and establishes a reusable, principled methodology for historical text digitization and normalization.
📝 Abstract
The absence of standardized spelling conventions and the organic evolution of human language present an inherent linguistic challenge within historical documents, a longstanding concern for scholars in the humanities. Addressing this issue, spelling normalization endeavors to align a document's orthography with contemporary standards. In this study, we propose two new approaches based on large language models: one of which has been trained without a supervised training, and a second one which has been trained for machine translation. Our evaluation spans multiple datasets encompassing diverse languages and historical periods, leading us to the conclusion that while both of them yielded encouraging results, statistical machine translation still seems to be the most suitable technology for this task.