Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing approaches struggle to adaptively regulate the cognitive behavior of large language models during inference due to their reliance on explicit, static control strategies that cannot flexibly respond to dynamic reasoning states and error-correction demands. This work proposes a runtime adaptive framework that implicitly steers beneficial cognitive behaviors by optimizing latent states in sparse autoencoders (SAEs). It employs an implicit reward model—trained solely on final answer correctness—to evaluate the quality of intermediate reasoning states, and integrates this reward with a confidence-gated mechanism to apply precise gradient-based interventions when the model is in fragile states. Notably, the method requires no predefined behavioral templates or fixed guidance directions, achieving, for the first time, state-adaptive control driven by latent-space reward signals. Extensive experiments across multiple large models and reasoning benchmarks demonstrate significant improvements over existing baselines, with post-hoc analyses confirming its effectiveness in promoting error-correcting cognitive behaviors.

📝 Abstract

Strong reasoning depends not only on model knowledge but also on how effectively cognitive behaviors are deployed during generation. Existing methods often rely on explicit behavior-level control, making them insufficiently adaptive when failures and required corrections vary across reasoning states, tasks, and models. To this end, we propose Latent Reward Steering (LRS), an adaptive inference-time framework that promotes cognitive behaviors by optimizing the sparse-autoencoder (SAE) latent states that implicitly carry them. Rather than relying on predefined cognitive behaviors or steering directions derived from them, LRS trains a latent reward model on reasoning traces by final answer correctness to estimate the quality of intermediate latent states. During inference, reward gradients provide state-specific correction directions for fragile latent states, while a reward and confidence gate restricts intervention to states the reward signal flags as fragile. Experiments on multiple reasoning LLM backbones and benchmarks show that \ours consistently improves performance over various baselines, and post-hoc analyses further indicate that \ours implicitly promotes good cognitive behaviors that fix the original reasoning errors. Code is available at: https://github.com/jiakanglee/Latent-Reward-Steering.

Problem

Research questions and friction points this paper is trying to address.

cognitive behaviors

reasoning LLMs

adaptive inference

latent states

reward steering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Reward Steering

Sparse Autoencoder

Inference-Time Adaptation