Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

📅 2025-04-18

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

To address the computational asymmetry in large language model reinforcement learning—where inference is highly parallelizable but policy updates are memory-intensive and suffer from high synchronization overhead—this paper proposes the PODS framework. PODS generates massive parallel rollouts, then selects only the most information-rich subset for policy updates. Its core contribution is the first provably efficient max-variance down-sampling method, theoretically grounded in the variance diversity of reward signals, which strictly decouples rollout sub-sampling from policy optimization. Evaluated on the GSM8K benchmark, PODS significantly outperforms standard GRPO: it improves inference accuracy while reducing GPU memory consumption by 37% and accelerating training time by 2.1×.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has emerged as a powerful paradigm for enhancing reasoning capabilities in large language models, but faces a fundamental asymmetry in computation and memory requirements: inference is embarrassingly parallel with a minimal memory footprint, while policy updates require extensive synchronization and are memory-intensive. To address this asymmetry, we introduce PODS (Policy Optimization with Down-Sampling), a framework that strategically decouples these phases by generating numerous rollouts in parallel but updating only on an informative subset. Within this framework, we develop max-variance down-sampling, a theoretically motivated method that selects rollouts with maximally diverse reward signals. We prove that this approach has an efficient algorithmic solution, and empirically demonstrate that GRPO with PODS using max-variance down-sampling achieves superior performance over standard GRPO on the GSM8K benchmark.

Problem

Research questions and friction points this paper is trying to address.

Addresses computation-memory asymmetry in RL for LLMs

Introduces PODS to decouple rollout generation and policy updates

Proposes max-variance down-sampling to select informative rollouts

Innovation

Methods, ideas, or system contributions that make the work stand out.

PODS decouples rollout and update phases

Max-variance down-sampling selects diverse rewards

Efficient algorithm for optimal rollout selection

🔎 Similar Papers

No similar papers found.