Learning to Explore with Parameter-Space Noise: A Deep Dive into Parameter-Space Noise for Reinforcement Learning with Verifiable Rewards

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the exploration bottleneck in existing Reinforcement Learning with Verifiable Rewards (RLVR) methods for large language model reasoning, which struggle to discover novel strategies under high sampling budgets and instead merely reweight known solution paths. To overcome this limitation, the authors propose a Parameter Space Noise (PSN) mechanism that enables trajectory-level consistent exploration and integrates truncated importance sampling to mitigate the mismatch between sampling and policy updates. Furthermore, they introduce a lightweight adaptive noise scheduler driven by semantic diversity and normalized self-confidence, which maintains long-horizon reasoning coherence without requiring expensive KL divergence computations. The proposed approach is orthogonal to existing RLVR frameworks and consistently improves pass-at-k performance across multiple mathematical reasoning benchmarks and model families, outperforming current exploration-oriented methods.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) improves LLM reasoning, yet growing evidence indicates an exploration ceiling: it often reweights existing solution traces rather than discovering new strategies, limiting gains under large sampling budgets (e.g., pass-at-256). We address this limitation with PSN-RLVR, which perturbs policy parameters before rollout generation to induce temporally consistent, trajectory-level exploration that better preserves long-horizon chain-of-thought coherence than action-space noise. To mitigate the resulting sampling-update mismatch, we incorporate truncated importance sampling (TIS). To avoid expensive KL-based adaptive noise control, we propose a computationally efficient real-time adaptive noise scheduler driven by a lightweight surrogate that combines semantic diversity with normalized self-certainty. Instantiated on GRPO, a widely used RLVR method, PSN-GRPO consistently expands the effective reasoning capability boundary across multiple mathematical reasoning benchmarks and model families, yielding higher pass-at-k under large sampling budgets and outperforming prior exploration-oriented RLVR methods (e.g., Pass-at-k-style training) while remaining orthogonal and thus composable for additional gains.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning with Verifiable Rewards
exploration ceiling
reasoning strategies
large sampling budgets
trajectory-level exploration
Innovation

Methods, ideas, or system contributions that make the work stand out.

parameter-space noise
reinforcement learning with verifiable rewards
temporally consistent exploration
truncated importance sampling
adaptive noise scheduling
🔎 Similar Papers
B
Bizhe Bai
College of Future Information Technology, Fudan University, Shanghai, China; Shanghai Innovation Institute, Shanghai, China
X
Xinyue Wang
College of Future Information Technology, Fudan University, Shanghai, China
P
Peng Ye
Shanghai AI Laboratory, Shanghai, China; The Chinese University of Hong Kong, Hong Kong, China
Tao Chen
Tao Chen
Fudan University
Deep LearningMedical Image Segmentation