Two Spelling Normalization Approaches Based on Large Language Models

📅 2025-06-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Historical texts suffer from extreme orthographic variability due to nonstandardized spelling conventions and diachronic language change, severely impeding humanities research. To address this, we propose two large language model (LLM)-based spelling normalization methods: (1) unsupervised LLM fine-tuning—first exploring its feasibility for cross-temporal, multilingual spelling alignment—and (2) a supervised neural machine translation (NMT)-based mapping approach. Both automatically convert historical texts into modern standardized orthography. Experiments span multilingual, multi-period historical corpora. Results show that the NMT paradigm achieves superior overall accuracy, while the unsupervised LLM method demonstrates strong generalization capacity and robustness under low-resource conditions. Our work systematically delineates the applicability boundaries of these distinct technical pathways and establishes a reusable, principled methodology for historical text digitization and normalization.

Technology Category

Application Category

📝 Abstract

The absence of standardized spelling conventions and the organic evolution of human language present an inherent linguistic challenge within historical documents, a longstanding concern for scholars in the humanities. Addressing this issue, spelling normalization endeavors to align a document's orthography with contemporary standards. In this study, we propose two new approaches based on large language models: one of which has been trained without a supervised training, and a second one which has been trained for machine translation. Our evaluation spans multiple datasets encompassing diverse languages and historical periods, leading us to the conclusion that while both of them yielded encouraging results, statistical machine translation still seems to be the most suitable technology for this task.

Problem

Research questions and friction points this paper is trying to address.

Standardize spelling in historical documents

Compare unsupervised and machine translation models

Evaluate performance across diverse languages and periods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised large language model training

Machine translation-based normalization approach

Evaluation across diverse languages and periods

🔎 Similar Papers

No similar papers found.

Authors to Follow