Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

GRPO exhibits implicit bias in formal theorem proving: it over-rewards high-frequency correct proofs while suppressing low-probability yet valid proof paths, thereby limiting pass@$N$ performance. To address this, we propose a “Non-likelihood Reward” mechanism within the Group Relative Policy Optimization (GRPO) framework, which explicitly incentivizes rare but correct proofs. Coupled with multi-round PPO optimization, this approach mitigates probability sharpening—where policy gradients excessively concentrate probability mass on dominant modes—and enhances exploration of underrepresented solution regions. Implemented in the Lean theorem-proving environment, our method significantly improves solution-space diversity and rare-path discovery capability. Experiments on miniF2F-test demonstrate that our approach consistently outperforms standard GRPO across all pass@$N$ metrics, achieving performance on par with DeepSeek-Prover-V1.5-RL. The implementation is fully open-sourced and reproducible.

Technology Category

Application Category

📝 Abstract

Reinforcement learning has emerged as an effective framework for training large language models on structured language-conditioned tasks. We identify a critical flaw of Group Relative Policy Optimization (GRPO), a widely used RL algorithm in this setting. For tasks that require multi-sample performance, such as formal theorem proving, GRPO biasedly reinforces already probable solutions and neglects rare but correct proofs. This implicit bias impairs performance on pass@$N$ metrics at large sample sizes, limiting its practicality for training theorem provers. To address this, we introduce the unlikeliness reward, a straightforward method that explicitly encourages reinforcing rare correct solutions. Additionally, we find that increasing the number of PPO epochs further mitigates this bias. Our experiments confirm that incorporating the unlikeliness reward significantly improves pass@$N$ across a large range of N, outperforming standard GRPO and substantially increasing sample diversity. Applying our revised recipe to Lean, we achieve competitive performance with DeepSeek-Prover-V1.5-RL on the miniF2F-test benchmark. We release our implementation, providing a simple yet effective recipe for training formal theorem provers with RL.

Problem

Research questions and friction points this paper is trying to address.

GRPO biases reinforcement towards probable solutions

GRPO neglects rare but correct proofs

GRPO impairs pass@N metrics at large samples

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces unlikeliness reward for rare solutions

Increases PPO epochs to reduce bias

Improves pass@N metrics and diversity

🔎 Similar Papers

No similar papers found.