Predicting Future Behaviors in Reasoning Models Enables Better Steering

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the challenge of uncontrollable behaviors in deployed reasoning models, where existing test-time intervention methods rely on features derived from already-generated text, limiting their ability to anticipate future behavior and often degrading output quality. To overcome this, the authors propose Future-Probing Controlled Generation (FPCG), a novel approach that trains activation probes to predict future behavioral tendencies from intermediate reasoning states and selects candidate outputs aligned with desired behaviors accordingly. This work establishes a new paradigm by explicitly distinguishing between “behavior detection” and “behavior prediction” features within the model and prioritizing predictive features as intervention targets. Experiments demonstrate that the probes achieve 64%–91% accuracy in forecasting future behaviors, and FPCG effectively guides model behavior with minimal impact on output quality, showing robust performance even in scenarios where existing intervention methods fail.

📝 Abstract

Deployed large reasoning models (LRMs) often behave unexpectedly. Test-time steering controls LRM outputs by intervening on their hidden representations, but it can degrade output quality. We argue that prior steering work implicitly relies on internal features that detect behavior in already generated text. We show that these detection features are poor predictors of future behavioral outcomes, and thus not the natural intervention target. Instead, we train activation probes to predict future behavior likelihoods from intermediate reasoning steps. These probes predict the most likely behavior with 64%-91% accuracy, revealing a separate type of internal prediction features. Building on these prediction features, we introduce a text-level steering method, Future Probe Controlled Generation. FPCG samples multiple candidate sentences and chooses the best one according to a probe predicting the future behavior likelihood. This enables steering with almost no output quality degradation. FPCG also enables steering in several evaluations where activation steering fails. These results show that distinguishing detection and prediction features enables a more nuanced approach to controlling LRM behaviors.

Problem

Research questions and friction points this paper is trying to address.

large reasoning models

test-time steering

behavior prediction

activation probes

output quality degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

future behavior prediction

activation probes

steering