π€ AI Summary
This work addresses the zero-advantage collapse and hallucination-induced overconfidence that arise in group relative policy optimization (GRPO) when applied to long-chain reasoning with sparse binary rewards. To mitigate these issues, the authors propose Intrinsic Signal Policy Optimization (ISPO), which derives dense, endogenous rewards entirely from the policyβs own conditional probabilities. ISPO integrates a sequence-level measure of information content in reasoning trajectories with token-level directional rewards, augmented by a hinge penalty that explicitly penalizes overconfident hallucinations. This design effectively alleviates gradient vanishing and erroneous self-assurance during training. Experiments across three base models and five mathematical reasoning benchmarks demonstrate that ISPO consistently outperforms strong baselines, with particularly pronounced gains on the most challenging tasks. Training dynamics further confirm its capacity to suppress the two identified structural failure modes.
π Abstract
Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for eliciting long-chain reasoning in large language models. However, existing methods based on Group Relative Policy Optimization (GRPO) rely on a binary outcome reward, which induces two structural failure modes: Zero-Advantage Collapse, in which all rollouts in a group share the same outcome and the gradient vanishes, and Hallucinated Certainty, in which the model becomes increasingly confident on incorrect rollouts late in training. We address both modes by densifying the reward with intrinsic signals computed entirely from the policy's own conditional probabilities, and propose ISPO (Intrinsic Signal Policy Optimization, which combines a sequence-level signal measuring how informative the thinking trajectory is for the final answer, with a token-level directional reward whose hallucinated-certainty hinge penalizes confidently-wrong predictions at critical decision tokens. Across three base models and five mathematical reasoning benchmarks, ISPO consistently outperforms competitive baselines, with the largest gains on the hardest benchmarks where zero-advantage collapse is most frequent, and training-dynamics diagnostics confirm that both failure modes are decreased.