Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing RL-based verifiable reasoning (RLVR) methods—such as PPO and GRPO—suffer from training instability, diversity collapse, and heavy dependence on manual hyperparameter tuning when enhancing LLMs’ mathematical reasoning capabilities. Method: This paper proposes a simplified reinforcement learning paradigm: Random Policy Evaluation (RPE) with a fixed uniform policy. Its core insight is that the optimal action can be directly recovered from the Q-function induced by this uniform policy, thereby bypassing conventional policy iteration and eliminating policy degradation and complex optimization. The problem is formalized as a finite-horizon MDP with deterministic state transitions, tree-structured dynamics, and binary terminal rewards; actions are sampled via softmax-weighted Q-values of the uniform policy. Results: On multiple mathematical reasoning benchmarks, RPE achieves +8.2% pass@1 and +16.8% pass@256 improvements, along with a +17.6% gain in reasoning diversity—without heuristic tricks or fine-grained hyperparameter tuning.

Technology Category

Application Category

📝 Abstract
RL with Verifiable Rewards (RLVR) has emerged as a promising paradigm for improving the reasoning abilities of large language models (LLMs). Current methods rely primarily on policy optimization frameworks like PPO and GRPO, which follow generalized policy iteration that alternates between evaluating the current policy's value and improving the policy based on evaluation. While effective, they often suffer from training instability and diversity collapse, requiring complex heuristic tricks and careful tuning. We observe that standard RLVR in math reasoning can be formalized as a specialized finite-horizon Markov Decision Process with deterministic state transitions, tree-structured dynamics, and binary terminal rewards. Though large in scale, the underlying structure is simpler than general-purpose control settings for which popular RL algorithms (e.g., PPO) were developed, suggesting that several sophisticated techniques in existing methods may be reduced or even omitted. Based on this insight, we prove a surprising result: the optimal action can be recovered from the Q-function of a fixed uniformly random policy, thereby bypassing the generalized policy iteration loop and its associated heuristics. We introduce Random Policy Valuation for Diverse Reasoning (ROVER) to translate this principle into a practical and scalable algorithm for LLM math reasoning, a minimalist yet highly effective RL method that samples actions from a softmax over these uniform-policy Q-values. ROVER preserves diversity throughout training, allowing sustained exploration of multiple valid pathways. Across multiple base models and standard math reasoning benchmarks, ROVER demonstrates superior performance in both extbf{quality} ( extbf{+8.2} on pass@1, extbf{+16.8} on pass@256) and extbf{diversity} ( extbf{+17.6%}), despite its radical simplification compared to strong, complicated existing methods.
Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM reasoning with verifiable rewards
Addressing training instability in policy optimization
Enhancing solution diversity in mathematical reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses uniform random policy Q-values for action selection
Simplifies RL by bypassing generalized policy iteration loop
Maintains solution diversity through softmax sampling
🔎 Similar Papers
No similar papers found.