Forecasting Future Behavior as a Learning Task

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge that large reasoning models (LRMs) exhibit behaviors difficult to reliably predict using conventional interpretability methods, particularly suffering from poor faithfulness and limited generalization over long reasoning trajectories. The study formalizes LRM behavior prediction as an end-to-end learnable task and introduces a predictor that requires neither human annotations nor explicit explanations, instead forecasting future model behavior directly from a single observed reasoning trace. Initialized from the target LRM and fine-tuned on unlabeled query data, the predictor operates with just a single forward pass. Experimental results across three diverse reasoning benchmarks demonstrate that this approach significantly outperforms naive reading baselines from GPT-5.4 and Claude Opus-4.6 on two tasks while substantially reducing inference cost.

📝 Abstract

Trust in an AI system is often anchored by explanations of how it works, which one then uses to forecast its behavior on new inputs. For large reasoning models (LRMs), this conventional route is particularly difficult to follow: explanation methods for single token generations do not naturally generalize to long trajectories, and the trajectories themselves are often not faithful when read as natural language. We propose an alternative that bypasses the explanation step: treat behavior forecasting as a learnable task and train Behavior Forecasters that operates on a single reasoning trajectory to make the same forecasts one would typically seek from an explanation. The forecaster's training data is obtained by querying the LRM with no human annotation, and its inference is done in a single forward pass. We instantiate this approach on two tasks: how likely the LRM is to repeat its answer on re-runs, and how removing parts of the input changes its answer. We evaluate this approach on both tasks across three diverse reasoning datasets and find that trained Behavior Forecasters are more accurate than GPT-5.4 and Claude Opus-4.6 reading the same trajectories as naive readers, at a small fraction of their inference cost. We find that fine-tuning the backbone end-to-end and initializing it from the target LRM are each necessary for strong performance. These results show that the reasoning trajectory carries information about the LRM's future behavior that goes beyond what naive reading conveys.

Problem

Research questions and friction points this paper is trying to address.

behavior forecasting

large reasoning models

reasoning trajectories

AI trust

model interpretability

Innovation

Methods, ideas, or system contributions that make the work stand out.

behavior forecasting

large reasoning models

reasoning trajectory