🤖 AI Summary
This study investigates how the reliability of large language models (LLMs) as forecasters evolves across the market lifecycle. Leveraging 150 CFTC-regulated Kalshi binary markets, the authors conduct a controlled evaluation of ten state-of-the-art models—with and without web search—at five distinct time points, employing Brier Skill Score, multi-model ensembles, and a cross-temporal experimental design. The work provides the first systematic evidence that LLMs exhibit stronger predictive performance during early market stages and periods of high uncertainty. It further reveals that the efficacy of tool use is context-dependent: while web search generally enhances accuracy, it degrades performance in 12% of cases. Simple ensembling reduces individual model errors but fails to surpass market consensus forecasts. These findings advance the development of time-aware evaluation protocols and selective deference strategies in forecasting applications.
📝 Abstract
We introduce TimeSeek, a benchmark for studying how the reliability of agentic LLM forecasters changes over a prediction market's lifecycle. We evaluate 10 frontier models on 150 CFTC-regulated Kalshi binary markets at five temporal checkpoints, with and without web search, for 15,000 forecasts total. Models are most competitive early in a market's life and on high-uncertainty markets, but much less competitive near resolution and on strong-consensus markets. Web search improves pooled Brier Skill Score (BSS) for every model overall, yet hurts in 12% of model-checkpoint pairs, indicating that retrieval is helpful on average but not uniformly so. Simple two-model ensembles reduce error without surpassing the market overall. These descriptive results motivate time-aware evaluation and selective-deference policies rather than a single market snapshot or a uniform tool-use setting.