TimeSeek: Temporal Reliability of Agentic Forecasters

📅 2026-04-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study investigates how the reliability of large language models (LLMs) as forecasters evolves across the market lifecycle. Leveraging 150 CFTC-regulated Kalshi binary markets, the authors conduct a controlled evaluation of ten state-of-the-art models—with and without web search—at five distinct time points, employing Brier Skill Score, multi-model ensembles, and a cross-temporal experimental design. The work provides the first systematic evidence that LLMs exhibit stronger predictive performance during early market stages and periods of high uncertainty. It further reveals that the efficacy of tool use is context-dependent: while web search generally enhances accuracy, it degrades performance in 12% of cases. Simple ensembling reduces individual model errors but fails to surpass market consensus forecasts. These findings advance the development of time-aware evaluation protocols and selective deference strategies in forecasting applications.

📝 Abstract

We introduce TimeSeek, a benchmark for studying how the reliability of agentic LLM forecasters changes over a prediction market's lifecycle. We evaluate 10 frontier models on 150 CFTC-regulated Kalshi binary markets at five temporal checkpoints, with and without web search, for 15,000 forecasts total. Models are most competitive early in a market's life and on high-uncertainty markets, but much less competitive near resolution and on strong-consensus markets. Web search improves pooled Brier Skill Score (BSS) for every model overall, yet hurts in 12% of model-checkpoint pairs, indicating that retrieval is helpful on average but not uniformly so. Simple two-model ensembles reduce error without surpassing the market overall. These descriptive results motivate time-aware evaluation and selective-deference policies rather than a single market snapshot or a uniform tool-use setting.

Problem

Research questions and friction points this paper is trying to address.

temporal reliability

agentic forecasters

prediction markets

LLM evaluation

time-aware forecasting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Reliability

Agentic Forecasters

Prediction Markets

Web Search Integration

Time-aware Evaluation

🔎 Similar Papers

DLFormer: Enhancing Explainability in Multivariate Time Series Forecasting using Distributed Lag Embedding

2024-08-29arXiv.orgCitations: 0

Authors to Follow