RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning

📅 2026-05-31
📈 Citations: 0
Influential: 0
📄 PDF

career value

174K/year
🤖 AI Summary
This work addresses the challenge in reinforcement learning where a large proportion of ineffective samples—such as response groups that are uniformly correct or incorrect—yield zero reward variance, thereby failing to provide informative learning signals and hindering the enhancement of large language models’ reasoning capabilities. To overcome this limitation, the authors propose the Group Prioritized Off-Policy Optimization (POPO) framework, which leverages reward-variance-based sample validity assessment, a prioritized group replay mechanism, and decoupled importance sampling to efficiently utilize effective training batches while mitigating off-policy bias—all without incurring additional LLM inference overhead. Integrated with trust-region-constrained policy optimization, POPO significantly accelerates reinforcement fine-tuning and achieves superior reasoning performance across mathematical reasoning, planning, and visual geometry tasks using fewer rollouts.
📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). However, its effectiveness is substantially hindered by the prevalence of ineffective training data: many sampled prompts yield response groups that are either entirely correct or entirely incorrect, resulting in zero-variance rewards and limited learning signals. Recent state-of-the-art methods address this issue through extensive LLM rollouts to filter ineffective samples, but at the cost of considerable computational overhead. Alternative approaches, including predictive sampling and trajectory replay, aim to improve data efficiency but often remain insufficient and may introduce additional issues such as systematic bias or suboptimal constraints. To address these limitations, we propose Group Prioritized Off-Policy Optimization (POPO), a simple yet effective framework that fully exploits effective training batches without additional rollout overhead. POPO comprises two key components: prioritized group replay and decoupled off-policy optimization. The former replaces ineffective on-policy groups with effective off-policy groups via a recency-based replay mechanism that jointly considers sample quality and the degree of off-policiness. To further mitigate the off-policy gap, POPO employs decoupled importance sampling to correct off-policy bias while maintaining stable policy updates under consistent trust-region constraints. Empirical evaluations across diverse reasoning tasks, including mathematics, planning, and visual geometry, demonstrate that POPO substantially accelerates RL finetuning and achieves strong reasoning performance with significantly fewer rollouts.
Problem

Research questions and friction points this paper is trying to address.

RLVR
ineffective samples
reward variance
LLM reasoning
off-policy optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Group Prioritized Replay
Off-Policy Optimization
Importance Sampling
RLVR
Data Efficiency