TimeSeek: Temporal Reliability of Agentic Forecasters

📅 2026-04-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates how the reliability of large language models (LLMs) as forecasters evolves across the market lifecycle. Leveraging 150 CFTC-regulated Kalshi binary markets, the authors conduct a controlled evaluation of ten state-of-the-art models—with and without web search—at five distinct time points, employing Brier Skill Score, multi-model ensembles, and a cross-temporal experimental design. The work provides the first systematic evidence that LLMs exhibit stronger predictive performance during early market stages and periods of high uncertainty. It further reveals that the efficacy of tool use is context-dependent: while web search generally enhances accuracy, it degrades performance in 12% of cases. Simple ensembling reduces individual model errors but fails to surpass market consensus forecasts. These findings advance the development of time-aware evaluation protocols and selective deference strategies in forecasting applications.
📝 Abstract
We introduce TimeSeek, a benchmark for studying how the reliability of agentic LLM forecasters changes over a prediction market's lifecycle. We evaluate 10 frontier models on 150 CFTC-regulated Kalshi binary markets at five temporal checkpoints, with and without web search, for 15,000 forecasts total. Models are most competitive early in a market's life and on high-uncertainty markets, but much less competitive near resolution and on strong-consensus markets. Web search improves pooled Brier Skill Score (BSS) for every model overall, yet hurts in 12% of model-checkpoint pairs, indicating that retrieval is helpful on average but not uniformly so. Simple two-model ensembles reduce error without surpassing the market overall. These descriptive results motivate time-aware evaluation and selective-deference policies rather than a single market snapshot or a uniform tool-use setting.
Problem

Research questions and friction points this paper is trying to address.

temporal reliability
agentic forecasters
prediction markets
LLM evaluation
time-aware forecasting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Reliability
Agentic Forecasters
Prediction Markets
Web Search Integration
Time-aware Evaluation
Dennis Lee
Dennis Lee
Google
H
Hamza Mostafa
Cheriton School of Computer Science, University of Waterloo
O
Om Shastri
The Wharton School, University of Pennsylvania