π€ AI Summary
Real-world forecasting often suffers from label delay, as ground-truth outcomes become available only after events occur, rendering conventional supervised learning inapplicable. This work proposes a βfuture-as-labelβ paradigm by formulating forecasting as a reinforcement learning problem with verifiable rewards: post-hoc ground-truth outcomes serve as supervision signals to train language models under causal masking constraints, enabling probabilistic predictions that are retrospectively evaluated using strictly proper scoring rules such as the Brier score. The approach requires no manual annotation and enables end-to-end learning from delayed rewards. Experiments demonstrate that the Qwen3-32B model achieves a 27% improvement in Brier score and halves calibration error on real-world forecasting tasks from Metaculus, outperforming the significantly larger Qwen3-235B model despite having only one-seventh of its parameters.
π Abstract
Time creates free supervision: forecasts about real-world events resolve to verifiable outcomes. The passage of time provides labels that require no annotation. To exploit this structure, we extend reinforcement learning with verifiable rewards to real-world prediction over time. We train language models to make probabilistic forecasts from causally masked information, using proper scoring rules as the reward function once events resolve. Learning is driven entirely by realized outcomes, enabling scalable outcome-based supervision in open-world prediction. On real-world forecasting benchmarks, Qwen3-32B trained using Foresight Learning improves Brier score by 27% and halves calibration error relative to its pretrained baseline, and outperforms Qwen3-235B on both constructed future-event prediction tasks and the Metaculus benchmark despite a 7x parameter disadvantage.