Historical German Text Normalization Using Type- and Token-Based Language Modeling

📅 2024-09-04

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

To address severe orthographic variation in 1700–1900 German historical texts—which impedes retrieval and NLP—this paper proposes a joint type-level and token-level orthographic normalization method. Departing from conventional single-track modeling, we introduce a novel dual-track architecture: a Transformer encoder-decoder handles type-level normalization, while a pretrained causal language model (e.g., GPT-style) performs context-aware token-level correction; both components are jointly supervised and fine-tuned on limited parallel data. Under low-resource conditions, our approach approaches the performance of end-to-end large models and achieves state-of-the-art accuracy on standard benchmarks, significantly outperforming comparably sized sentence-level systems. Experiments demonstrate substantial improvements in full-text retrieval precision and compatibility with downstream NLP tasks. The method establishes an efficient, scalable new paradigm for orthographic standardization of early modern German historical documents.

Technology Category

Application Category

📝 Abstract

Historic variations of spelling poses a challenge for full-text search or natural language processing on historical digitized texts. To minimize the gap between the historic orthography and contemporary spelling, usually an automatic orthographic normalization of the historical source material is pursued. This report proposes a normalization system for German literary texts from c. 1700-1900, trained on a parallel corpus. The proposed system makes use of a machine learning approach using Transformer language models, combining an encoder-decoder model to normalize individual word types, and a pre-trained causal language model to adjust these normalizations within their context. An extensive evaluation shows that the proposed system provides state-of-the-art accuracy, comparable with a much larger fully end-to-end sentence-based normalization system, fine-tuning a pre-trained Transformer large language model. However, the normalization of historical text remains a challenge due to difficulties for models to generalize, and the lack of extensive high-quality parallel data.

Problem

Research questions and friction points this paper is trying to address.

Normalize historical German text spelling

Bridge historic and contemporary orthography

Improve NLP on historical digitized texts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer language models for normalization

Encoder-decoder model for word types

Pre-trained causal language model for context

🔎 Similar Papers

Is text normalization relevant for classifying medieval charters?