Outcome-based Reinforcement Learning to Predict the Future

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

📄 PDF

🤖 AI Summary

研究通过改进强化学习方法（GRPO和ReMax），解决现实预测任务中奖励信号延迟、噪声大的问题，使14B模型在预测准确性和校准性上超越前沿基线。

Technology Category

Application Category

📝 Abstract

Reinforcement learning with verifiable rewards (RLVR) has boosted math and coding in large language models, yet there has been little effort to extend RLVR into messier, real-world domains like forecasting. One sticking point is that outcome-based reinforcement learning for forecasting must learn from binary, delayed, and noisy rewards, a regime where standard fine-tuning is brittle. We show that outcome-only online RL on a 14B model can match frontier-scale accuracy and surpass it in calibration and hypothetical prediction market betting by adapting two leading algorithms, Group-Relative Policy Optimisation (GRPO) and ReMax, to the forecasting setting. Our adaptations remove per-question variance scaling in GRPO, apply baseline-subtracted advantages in ReMax, hydrate training with 100k temporally consistent synthetic questions, and introduce lightweight guard-rails that penalise gibberish, non-English responses and missing rationales, enabling a single stable pass over 110k events. Scaling ReMax to 110k questions and ensembling seven predictions yields a 14B model that matches frontier baseline o1 on accuracy on our holdout set (Brier = 0.193, p = 0.23) while beating it in calibration (ECE = 0.042, p<0.001). A simple trading rule turns this calibration edge into $127 of hypothetical profit versus $92 for o1 (p = 0.037). This demonstrates that refined RLVR methods can convert small-scale LLMs into potentially economically valuable forecasting tools, with implications for scaling this to larger models.

Problem

Research questions and friction points this paper is trying to address.

Extends RLVR to real-world forecasting with noisy rewards

Adapts GRPO and ReMax for outcome-based forecasting tasks

Enhances small LLMs into economically valuable forecasting tools

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapt GRPO and ReMax for forecasting

Use synthetic questions for training

Implement guard-rails for response quality

🔎 Similar Papers

No similar papers found.

Authors to Follow