ConSteer-RL: Steering Reasoning Capabilities in Large Language Models via Confidence-Aware Reinforcement Learning

📅 2026-06-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of existing reinforcement learning approaches that rely on sparse binary reward signals and neglect a model’s internal uncertainty, thereby struggling to effectively guide reasoning in large language models (LLMs). Within the Group Relative Policy Optimization (GRPO) framework, the study introduces, for the first time, a token-level confidence-driven reward shaping mechanism. This mechanism constructs confidence-aware rewards using the model’s log-probabilities, penalizing overconfident errors while reinforcing correct and confident reasoning trajectories. By organically integrating intrinsic uncertainty with external rewards, the proposed method consistently outperforms the GRPO baseline across LLMs of varying scales, achieving average reasoning performance gains of 2.3%–4.0%.

📝 Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) has recently become a key paradigm for improving the reasoning abilities of Large Language Models (LLMs), yet it remains limited by sparse binary rewards and its ignorance of model-internal uncertainty. In this paper, we propose ConSteer-RL, a simple yet effective framework that integrates token-level confidence signals derived from model log-probabilities into RLVR training. Specifically, building upon the Group Relative Policy Optimization (GRPO) framework, we construct a confidence-aware reward by aggregating per-token probabilities into a scalar confidence score and incorporating it into an awareness-based reward shaping mechanism that penalizes overconfident errors while reinforcing correct and confident reasoning. Experimental results demonstrate that ConSteer-RL consistently outperforms strong GRPO baselines, achieving average improvements of 2.3%-4.0% across different model scales.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning from Verifiable Rewards

reasoning capabilities

Large Language Models

model uncertainty

sparse rewards

Innovation

Methods, ideas, or system contributions that make the work stand out.

Confidence-aware reinforcement learning

Reasoning enhancement

Token-level confidence