Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

177K/year
🤖 AI Summary
Existing GRPO methods rely on token-level stochasticity to enhance rollout diversity, which often introduces noise and compromises trajectory coherence. This work proposes the S2L-PO framework, which innovatively employs a fixed, smaller sibling model as a policy-level explorer to generate logically consistent yet diverse rollouts. A progressive annealing mechanism then smoothly transitions control from the small model to the large target model, effectively balancing exploration and exploitation. This paradigm avoids mid-training performance degradation and achieves substantial accuracy gains on mathematical reasoning benchmarks such as AIME 2024—for instance, improving an 8B model guided by a 1.7B model by 8.8%—while simultaneously reducing the computational cost of rollouts.
📝 Abstract
We identify a new dimension for enhancing rollout diversity in Group Relative Policy Optimization (GRPO) for LLMs. While GRPO relies on diverse rollouts, prevailing strategies primarily increase diversity by injecting more token-level randomness, which may introduce step-wise noise and lead to incoherent trajectories. We uncover that smaller models within the same model family inherently exhibit higher policy-level diversity, indicated by their superior pass@k relative to larger counterparts as sample counts increase. Unlike token-level noise, this diversity is temporally correlated, preserves logical consistency, and provides structured exploration signals for gradient estimation. We thus propose S2L-PO (Small-to-Large Policy Optimization), a framework that leverages fixed small models as natural explorers to train larger models. To balance exploration and exploitation, we design a progressive annealing strategy that transitions from offline small-model rollouts to the large learner's own sampling. This shift elegantly avoids mid-training performance drops caused by the small model's capacity limits, achieving faster convergence and unlocking a higher performance ceiling. S2L-PO improves accuracy on diverse mathematical reasoning benchmarks (e.g., +8.8% on AIME 24 using a 1.7B explorer to guide the 8B model) while reducing rollout compute.
Problem

Research questions and friction points this paper is trying to address.

policy-level diversity
rollout diversity
Group Relative Policy Optimization
exploration
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

policy-level diversity
small-to-large policy optimization
structured exploration
progressive annealing
GRPO