Two Spelling Normalization Approaches Based on Large Language Models

📅 2025-06-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Historical texts suffer from extreme orthographic variability due to nonstandardized spelling conventions and diachronic language change, severely impeding humanities research. To address this, we propose two large language model (LLM)-based spelling normalization methods: (1) unsupervised LLM fine-tuning—first exploring its feasibility for cross-temporal, multilingual spelling alignment—and (2) a supervised neural machine translation (NMT)-based mapping approach. Both automatically convert historical texts into modern standardized orthography. Experiments span multilingual, multi-period historical corpora. Results show that the NMT paradigm achieves superior overall accuracy, while the unsupervised LLM method demonstrates strong generalization capacity and robustness under low-resource conditions. Our work systematically delineates the applicability boundaries of these distinct technical pathways and establishes a reusable, principled methodology for historical text digitization and normalization.

Technology Category

Application Category

📝 Abstract
The absence of standardized spelling conventions and the organic evolution of human language present an inherent linguistic challenge within historical documents, a longstanding concern for scholars in the humanities. Addressing this issue, spelling normalization endeavors to align a document's orthography with contemporary standards. In this study, we propose two new approaches based on large language models: one of which has been trained without a supervised training, and a second one which has been trained for machine translation. Our evaluation spans multiple datasets encompassing diverse languages and historical periods, leading us to the conclusion that while both of them yielded encouraging results, statistical machine translation still seems to be the most suitable technology for this task.
Problem

Research questions and friction points this paper is trying to address.

Standardize spelling in historical documents
Compare unsupervised and machine translation models
Evaluate performance across diverse languages and periods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised large language model training
Machine translation-based normalization approach
Evaluation across diverse languages and periods
🔎 Similar Papers
No similar papers found.
M
Miguel Domingo
PRHLT Research Center, Universitat Polit`ecnica de Val`encia, Spain, ValgrAI - Valencian Graduate School and Research Network for Artificial Intelligence, Spain
Francisco Casacuberta
Francisco Casacuberta
Ad Honorem Professor, PRHLT, Polytechnic University of Valencia
Pattern RecognitionMachine TranslationMachine LearningMulti-modal InteractionVideo and image