Learning to Explore with Parameter-Space Noise: A Deep Dive into Parameter-Space Noise for Reinforcement Learning with Verifiable Rewards

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the exploration bottleneck in existing Reinforcement Learning with Verifiable Rewards (RLVR) methods for large language model reasoning, which struggle to discover novel strategies under high sampling budgets and instead merely reweight known solution paths. To overcome this limitation, the authors propose a Parameter Space Noise (PSN) mechanism that enables trajectory-level consistent exploration and integrates truncated importance sampling to mitigate the mismatch between sampling and policy updates. Furthermore, they introduce a lightweight adaptive noise scheduler driven by semantic diversity and normalized self-confidence, which maintains long-horizon reasoning coherence without requiring expensive KL divergence computations. The proposed approach is orthogonal to existing RLVR frameworks and consistently improves pass-at-k performance across multiple mathematical reasoning benchmarks and model families, outperforming current exploration-oriented methods.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) improves LLM reasoning, yet growing evidence indicates an exploration ceiling: it often reweights existing solution traces rather than discovering new strategies, limiting gains under large sampling budgets (e.g., pass-at-256). We address this limitation with PSN-RLVR, which perturbs policy parameters before rollout generation to induce temporally consistent, trajectory-level exploration that better preserves long-horizon chain-of-thought coherence than action-space noise. To mitigate the resulting sampling-update mismatch, we incorporate truncated importance sampling (TIS). To avoid expensive KL-based adaptive noise control, we propose a computationally efficient real-time adaptive noise scheduler driven by a lightweight surrogate that combines semantic diversity with normalized self-certainty. Instantiated on GRPO, a widely used RLVR method, PSN-GRPO consistently expands the effective reasoning capability boundary across multiple mathematical reasoning benchmarks and model families, yielding higher pass-at-k under large sampling budgets and outperforming prior exploration-oriented RLVR methods (e.g., Pass-at-k-style training) while remaining orthogonal and thus composable for additional gains.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning with Verifiable Rewards

exploration ceiling

reasoning strategies

large sampling budgets

trajectory-level exploration

Innovation

Methods, ideas, or system contributions that make the work stand out.

parameter-space noise

reinforcement learning with verifiable rewards

temporally consistent exploration

truncated importance sampling

adaptive noise scheduling

🔎 Similar Papers

Random Latent Exploration for Deep Reinforcement Learning

2024-07-18International Conference on Machine LearningCitations: 1

The Indoor-Training Effect: unexpected gains from distribution shifts in the transition function

2024-01-29Citations: 0

Authors to Follow