🤖 AI Summary
This work addresses the premature convergence of Verifiable Reward Reinforcement Learning (RLVR) in large language model inference, which often arises from rapid policy entropy collapse leading to overly narrow reasoning paths. To mitigate this, the authors propose a position-aware entropy calibration mechanism that dynamically identifies decision-sensitive token positions during generation. At these critical locations, a soft mask is constructed using local top-p entropy and competition between the top two candidate tokens, combined with an anchor-based lower-bound penalty to encourage selective exploration. This approach shifts entropy regularization from uniform global application to targeted modulation at key decision points, substantially improving exploration efficiency over long reasoning trajectories. The method consistently outperforms strong RLVR baselines across five mathematical reasoning benchmarks, achieving particularly notable gains on AIME-style problems.
📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) improves large language model reasoning but often suffers from rapid policy-entropy collapse, where the policy prematurely concentrates on narrow high-probability reasoning paths. While global entropy regularization can encourage exploration, uniformly increasing entropy across all token positions is inefficient for long reasoning trajectories, where many tokens are not decision-relevant. We propose Position-Aware Entropy Calibration (PAEC), a token-level entropy-management framework that constructs a soft mask from local top-p entropy and top-two candidate competition, and applies an anchor-based lower-bound penalty to prevent selected-position entropy collapse. Experiments on five mathematical reasoning benchmarks show that PAEC improves macro-average majority-vote performance over strong RLVR baselines, with clear gains on AIME-style tasks. Our results suggest that entropy management in reasoning RL should be formulated as selective exploration allocation over decision-sensitive positions rather than uniform randomness injection.