🤖 AI Summary
This work addresses the limitations of existing reinforcement learning approaches that rely on sparse binary reward signals and neglect a model’s internal uncertainty, thereby struggling to effectively guide reasoning in large language models (LLMs). Within the Group Relative Policy Optimization (GRPO) framework, the study introduces, for the first time, a token-level confidence-driven reward shaping mechanism. This mechanism constructs confidence-aware rewards using the model’s log-probabilities, penalizing overconfident errors while reinforcing correct and confident reasoning trajectories. By organically integrating intrinsic uncertainty with external rewards, the proposed method consistently outperforms the GRPO baseline across LLMs of varying scales, achieving average reasoning performance gains of 2.3%–4.0%.
📝 Abstract
Reinforcement Learning from Verifiable Rewards (RLVR) has recently become a key paradigm for improving the reasoning abilities of Large Language Models (LLMs), yet it remains limited by sparse binary rewards and its ignorance of model-internal uncertainty. In this paper, we propose ConSteer-RL, a simple yet effective framework that integrates token-level confidence signals derived from model log-probabilities into RLVR training. Specifically, building upon the Group Relative Policy Optimization (GRPO) framework, we construct a confidence-aware reward by aggregating per-token probabilities into a scalar confidence score and incorporating it into an awareness-based reward shaping mechanism that penalizes overconfident errors while reinforcing correct and confident reasoning. Experimental results demonstrate that ConSteer-RL consistently outperforms strong GRPO baselines, achieving average improvements of 2.3%-4.0% across different model scales.