Pretraining Language Models on Historical Text

📅 2026-06-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

161K/year
🤖 AI Summary
This work addresses the challenges of modeling English historical texts predating 1913—namely data scarcity, temporal information leakage, and chronological inconsistency—by proposing the first language modeling paradigm specifically designed for historical language. The authors construct TypewriterCorpus, a high-quality corpus comprising 54 billion tokens, and apply a leakage-prevention cleaning strategy to ensure temporal integrity. They further introduce a lexical anchoring instruction-tuning approach that constrains model outputs to align strictly with historical facts. The study releases TypewriterLM, a 7.24-billion-parameter language model, accompanied by two instruction datasets (History-LIMA and History-SelfInstruct) and a new evaluation benchmark, History-Event. The resulting model maintains strong general language capabilities while significantly improving temporal coherence and historical accuracy.
📝 Abstract
We introduce TypewriterLM, a 7.24B History language model (LM) trained exclusively on English text predating 1913. Developing History LMs requires addressing challenges in data quality and availability, preventing temporal leakage, designing temporally consistent post-training pipelines, and constructing reliable evaluations. To address these issues, we construct TypewriterCorpus, a 54B-token historical corpus collected from diverse archival and linguistically annotated sources with extensive data cleaning and leakage mitigation procedures. Furthermore, we introduce lexically grounded instructing tuning, a post-training framework that constraints responses to remain directly grounded in historical source documents. Using this framework we construct two historical instruction tuning datasets: History-LIMA and History-SelfInstruct. To evaluate capability and temporal consistency, we introduce History-Event, a benchmark suite for evaluating competence, temporal grounding and data leakage. We release TypewriterLM and all associated resources to support future research on historical language models.
Problem

Research questions and friction points this paper is trying to address.

historical language models
temporal leakage
data quality
temporal consistency
evaluation benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

historical language models
temporal leakage mitigation
lexically grounded instruction tuning
TypewriterCorpus
History-Event benchmark
🔎 Similar Papers
No similar papers found.