🤖 AI Summary
GRPO exhibits implicit bias in formal theorem proving: it over-rewards high-frequency correct proofs while suppressing low-probability yet valid proof paths, thereby limiting pass@$N$ performance. To address this, we propose a “Non-likelihood Reward” mechanism within the Group Relative Policy Optimization (GRPO) framework, which explicitly incentivizes rare but correct proofs. Coupled with multi-round PPO optimization, this approach mitigates probability sharpening—where policy gradients excessively concentrate probability mass on dominant modes—and enhances exploration of underrepresented solution regions. Implemented in the Lean theorem-proving environment, our method significantly improves solution-space diversity and rare-path discovery capability. Experiments on miniF2F-test demonstrate that our approach consistently outperforms standard GRPO across all pass@$N$ metrics, achieving performance on par with DeepSeek-Prover-V1.5-RL. The implementation is fully open-sourced and reproducible.
📝 Abstract
Reinforcement learning has emerged as an effective framework for training large language models on structured language-conditioned tasks. We identify a critical flaw of Group Relative Policy Optimization (GRPO), a widely used RL algorithm in this setting. For tasks that require multi-sample performance, such as formal theorem proving, GRPO biasedly reinforces already probable solutions and neglects rare but correct proofs. This implicit bias impairs performance on pass@$N$ metrics at large sample sizes, limiting its practicality for training theorem provers. To address this, we introduce the unlikeliness reward, a straightforward method that explicitly encourages reinforcing rare correct solutions. Additionally, we find that increasing the number of PPO epochs further mitigates this bias. Our experiments confirm that incorporating the unlikeliness reward significantly improves pass@$N$ across a large range of N, outperforming standard GRPO and substantially increasing sample diversity. Applying our revised recipe to Lean, we achieve competitive performance with DeepSeek-Prover-V1.5-RL on the miniF2F-test benchmark. We release our implementation, providing a simple yet effective recipe for training formal theorem provers with RL.