DatedGPT: Preventing Lookahead Bias in Large Language Models with Time-Aware Pretraining

πŸ“… 2026-03-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the issue of lookahead bias in financial backtesting caused by future information inadvertently embedded in the pretraining data of large language models, which compromises predictive validity. To mitigate this, the study presents the first systematic construction of a temporally aware model family comprising twelve 1.3B-parameter models, each pretrained and instruction-tuned exclusively on historical data up to a specific year from 2013 to 2024, with rigorous temporal consistency enforced during data curation. Experiments demonstrate that each model’s knowledge is effectively confined to its designated cutoff year while maintaining benchmark performance comparable to similarly sized standard models. The project further releases an open-source web platform enabling interactive cross-year model comparison, establishing a reproducible, lookahead-free paradigm for financial time-series modeling.

Technology Category

Application Category

πŸ“ Abstract
In financial backtesting, large language models pretrained on internet-scale data risk introducing lookahead bias that undermines their forecasting validity, as they may have already seen the true outcome during training. To address this, we present DatedGPT, a family of twelve 1.3B-parameter language models, each trained from scratch on approximately 100 billion tokens of temporally partitioned data with strict annual cutoffs spanning 2013 to 2024. We further enhance each model with instruction fine-tuning on both general-domain and finance-specific datasets curated to respect the same temporal boundaries. Perplexity-based probing confirms that each model's knowledge is effectively bounded by its data cutoff year, while evaluation on standard benchmarks shows competitive performance with existing models of similar scale. We provide an interactive web demo that allows users to query and compare responses from models across different cutoff years.
Problem

Research questions and friction points this paper is trying to address.

lookahead bias
large language models
financial backtesting
time-aware pretraining
temporal data cutoff
Innovation

Methods, ideas, or system contributions that make the work stand out.

lookahead bias
time-aware pretraining
temporal data partitioning
financial backtesting
DatedGPT
Y
Yutong Yan
Department of Finance, CUHK Business School, The Chinese University of Hong Kong
Raphael Tang
Raphael Tang
Microsoft
machine learningnatural language processingmultimodalityinformation retrieval
Z
Zhenyu Gao
Department of Finance, CUHK Business School, The Chinese University of Hong Kong
W
Wenxi Jiang
Department of Finance, CUHK Business School, The Chinese University of Hong Kong
Yao Lu
Yao Lu
Assistant Professor @ University College London
Natural Language Processing