🤖 AI Summary
This work addresses the exploration bottleneck in existing Reinforcement Learning with Verifiable Rewards (RLVR) methods for large language model reasoning, which struggle to discover novel strategies under high sampling budgets and instead merely reweight known solution paths. To overcome this limitation, the authors propose a Parameter Space Noise (PSN) mechanism that enables trajectory-level consistent exploration and integrates truncated importance sampling to mitigate the mismatch between sampling and policy updates. Furthermore, they introduce a lightweight adaptive noise scheduler driven by semantic diversity and normalized self-confidence, which maintains long-horizon reasoning coherence without requiring expensive KL divergence computations. The proposed approach is orthogonal to existing RLVR frameworks and consistently improves pass-at-k performance across multiple mathematical reasoning benchmarks and model families, outperforming current exploration-oriented methods.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) improves LLM reasoning, yet growing evidence indicates an exploration ceiling: it often reweights existing solution traces rather than discovering new strategies, limiting gains under large sampling budgets (e.g., pass-at-256). We address this limitation with PSN-RLVR, which perturbs policy parameters before rollout generation to induce temporally consistent, trajectory-level exploration that better preserves long-horizon chain-of-thought coherence than action-space noise. To mitigate the resulting sampling-update mismatch, we incorporate truncated importance sampling (TIS). To avoid expensive KL-based adaptive noise control, we propose a computationally efficient real-time adaptive noise scheduler driven by a lightweight surrogate that combines semantic diversity with normalized self-certainty. Instantiated on GRPO, a widely used RLVR method, PSN-GRPO consistently expands the effective reasoning capability boundary across multiple mathematical reasoning benchmarks and model families, yielding higher pass-at-k under large sampling budgets and outperforming prior exploration-oriented RLVR methods (e.g., Pass-at-k-style training) while remaining orthogonal and thus composable for additional gains.