🤖 AI Summary
This paper addresses the Pass@K optimization objective in reinforcement learning with verifiable rewards (RLVR), where reward signals are sparse and only available upon successful task completion. Method: The authors establish the fundamental equivalence between direct policy gradient methods (e.g., REINFORCE) and advantage shaping techniques by reinterpreting advantage shaping as implicit maximization of a surrogate reward. Through reverse-engineering of existing algorithms—including GRPO and reward-regularized variants—they show that all implicitly optimize the same class of surrogate rewards. Building on this insight, they develop a unified framework that derives policy gradient algorithms systematically from surrogate reward specifications. Contribution/Results: This work provides the first theoretical unification of Pass@K policy gradient methods under RLVR, yielding a general analytical paradigm and principled design guidelines for algorithm development in verifiable-reward settings.
📝 Abstract
This note reconciles two seemingly distinct approaches to policy gradient optimization for the Pass@K objective in reinforcement learning with verifiable rewards: (1) direct REINFORCE-style methods, and (2) advantage-shaping techniques that directly modify GRPO. We show that these are two sides of the same coin. By reverse-engineering existing advantage-shaping algorithms, we reveal that they implicitly optimize surrogate rewards. We specifically interpret practical ``hard-example up-weighting'' modifications to GRPO as reward-level regularization. Conversely, starting from surrogate reward objectives, we provide a simple recipe for deriving both existing and new advantage-shaping methods. This perspective provides a lens for RLVR policy gradient optimization beyond our original motivation of Pass@K.