Pretraining Language Models on Historical Text

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This work addresses the challenges of modeling English historical texts predating 1913—namely data scarcity, temporal information leakage, and chronological inconsistency—by proposing the first language modeling paradigm specifically designed for historical language. The authors construct TypewriterCorpus, a high-quality corpus comprising 54 billion tokens, and apply a leakage-prevention cleaning strategy to ensure temporal integrity. They further introduce a lexical anchoring instruction-tuning approach that constrains model outputs to align strictly with historical facts. The study releases TypewriterLM, a 7.24-billion-parameter language model, accompanied by two instruction datasets (History-LIMA and History-SelfInstruct) and a new evaluation benchmark, History-Event. The resulting model maintains strong general language capabilities while significantly improving temporal coherence and historical accuracy.

📝 Abstract

We introduce TypewriterLM, a 7.24B History language model (LM) trained exclusively on English text predating 1913. Developing History LMs requires addressing challenges in data quality and availability, preventing temporal leakage, designing temporally consistent post-training pipelines, and constructing reliable evaluations. To address these issues, we construct TypewriterCorpus, a 54B-token historical corpus collected from diverse archival and linguistically annotated sources with extensive data cleaning and leakage mitigation procedures. Furthermore, we introduce lexically grounded instructing tuning, a post-training framework that constraints responses to remain directly grounded in historical source documents. Using this framework we construct two historical instruction tuning datasets: History-LIMA and History-SelfInstruct. To evaluate capability and temporal consistency, we introduce History-Event, a benchmark suite for evaluating competence, temporal grounding and data leakage. We release TypewriterLM and all associated resources to support future research on historical language models.

Problem

Research questions and friction points this paper is trying to address.

historical language models

temporal leakage

data quality

temporal consistency

evaluation benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

historical language models

temporal leakage mitigation

lexically grounded instruction tuning