InternLM2.5-StepProver: Advancing Automated Theorem Proving via Expert Iteration on Large-Scale LEAN Problems

📅 2024-10-21

🏛️ arXiv.org

📈 Citations: 58

✨ Influential: 9

career value

212K/year

🤖 AI Summary

To address the limitations of large language models (LLMs) in formal theorem proving—specifically their neglect of strategy trajectory preferences and difficulty exploring deep proof paths—this paper proposes a critic-guided expert iteration framework. We construct a massive training corpus based on Lean-workbook, design a log-linear model to characterize the relationship between proof length and computational cost, and train a critic model to dynamically filter easily provable problems, thereby guiding strategic search. Our method integrates fine-tuned InternLM2.5, LEAN-based formal verification, and iterative expert refinement. Evaluated on MiniF2F, our approach achieves a 65.9% pass rate; on Lean-Workbook-Plus, it attains a 17.0% solution rate (+7.5% absolute improvement); and it sets new state-of-the-art results on both ProofNet and the Putnam benchmark. All models, source code, and proof search tools are publicly released.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have emerged as powerful tools in mathematical theorem proving, particularly when utilizing formal languages such as LEAN. The major learning paradigm is expert iteration, which necessitates a pre-defined dataset comprising numerous mathematical problems. In this process, LLMs attempt to prove problems within the dataset and iteratively refine their capabilities through self-training on the proofs they discover. We propose to use large scale LEAN problem datasets Lean-workbook for expert iteration with more than 20,000 CPU days. During expert iteration, we found log-linear trends between solved problem amount with proof length and CPU usage. We train a critic model to select relatively easy problems for policy models to make trials and guide the model to search for deeper proofs. InternLM2.5-StepProver achieves open-source state-of-the-art on MiniF2F, Lean-Workbook-Plus, ProofNet, and Putnam benchmarks. Specifically, it achieves a pass of 65.9% on the MiniF2F-test and proves (or disproves) 17.0% of problems in Lean-Workbook-Plus which shows a significant improvement compared to only 9.5% of problems proved when Lean-Workbook-Plus was released. We open-source our models and searched proofs at https://github.com/InternLM/InternLM-Math and https://huggingface.co/datasets/internlm/Lean-Workbook.

Problem

Research questions and friction points this paper is trying to address.

Improving automated theorem proving via critic-guided search

Capturing preference information from existing tactic trajectories

Boosting proof search performance through expert iteration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses critic model to guide prover search

Applies large-scale expert iteration fine-tuning

Combines prover-critic framework for theorem proving

🔎 Similar Papers

Lean-STaR: Learning to Interleave Thinking and Proving