🤖 AI Summary
Standard RLVR training allocates a fixed rollout budget uniformly across all queries, disregarding their inherent difficulty differences and leading to inefficient computation. This work proposes sorted Grouped Policy Optimization (sGPO), which, for the first time, leverages low-cost sampling during inference to estimate query difficulty and dynamically sets rollout group sizes inversely proportional to empirical success rates. This unified approach simultaneously enables adaptive sampling, data filtering, and curriculum learning. Requiring only initial policy rollouts and empirical success rate estimation—and accounting for inference overhead—sGPO reduces total training compute to one-third of the baseline while maintaining or improving performance.
📝 Abstract
Standard Reinforcement Learning with Verifiable Rewards (RLVR) training allocates a fixed rollout budget to every query, without regard for what each query's difficulty means for the current policy. This leads to two symmetric failure modes: easy queries produce near-zero advantage because the policy already solves them, while unsolvable queries produce no signal because the policy never solves them. Both regimes waste training FLOPs without contributing to a learning gradient. We introduce sorted Group Policy Optimization (sGPO), a compute-efficient strategy that trades a small budget of inference FLOPs for a large reduction in wasted training FLOPs. The key insight is that cheap inference compute can serve as a single offline proxy for query difficulty. By generating a small batch of parallel samples per query under the initial policy, we obtain a model-aware empirical success rate. This motivates setting the training rollout group size to the inverse of this success rate, a practical rule that maximizes sample efficiency by extracting the most advantage per generated rollout. This single profiling pass simultaneously drives data filtering (removing trivial queries and sub-sampling unsolvable ones), adaptive group size allocation, and curriculum construction (scheduling queries from easy to hard). sGPO matches or exceeds baseline performance while reducing total training compute by a factor of three, with the upfront inference profiling cost included.