Distilling LLM Feedback for Lean Theorem Proving

📅 2026-05-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

163K/year
🤖 AI Summary
This work addresses the challenges of reward sparsity, insufficient exploration, and mode collapse in large language models trained with the GRPO algorithm for Lean4 theorem proving. To mitigate these issues, the authors propose a feedback distillation approach that uniquely integrates token-level self-distillation with privileged feedback generated by the language model itself, enabling supervised training at the token level using the model’s own predictive distribution. This method effectively enhances reasoning diversity and knowledge injection while alleviating exploration bottlenecks inherent in reinforcement learning. By complementing GRPO with this distillation strategy, the joint training framework significantly improves policy entropy and pass@k performance, consistently outperforming either technique applied in isolation.
📝 Abstract
Post-training for reasoning models typically combines supervised fine-tuning with reinforcement learning from verifiable rewards, most commonly with GRPO. However, this algorithm suffers from sparse rewards, limited exploration, and mode collapse. Building upon recent works on self-distillation, we propose Feedback Distillation, a training method where the model is trained to match, at the token level, its own distribution conditioned on privileged feedback produced by a language model. Feedback Distillation offers token-level supervision and can inject external knowledge. Evaluating our method for Lean4 theorem-proving, we find that Feedback Distillation maintains greater diversity in generated trajectories than GRPO, yielding higher policy entropy and better pass@k scaling. The two methods are complementary: initializing GRPO from a Feedback Distillation checkpoint outperforms either method alone. All in all, our results suggest a promising avenue to improve post-training for complex reasoning.
Problem

Research questions and friction points this paper is trying to address.

sparse rewards
limited exploration
mode collapse
theorem proving
reasoning models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Feedback Distillation
token-level supervision
self-distillation
theorem proving
post-training