🤖 AI Summary
Existing formal theorem provers excel in high-school and competition-level mathematics but exhibit severely limited generalization to university-level tasks. This work introduces HERALD, the first open-source, stepwise Lean 4 theorem prover designed specifically for advanced mathematics. To overcome key bottlenecks, HERALD incorporates three core innovations: (1) a novel retrieval-augmented architecture integrating the dense retriever Leansearch-PS; (2) the HERALD-AF data distillation pipeline, which significantly improves formalization quality; and (3) the Jixia interactive environment, enabling end-to-end automatic translation from natural-language problems to Lean 4 statements. Built upon the fine-tuned large language model REAL-Prover-v1, HERALD combines supervised fine-tuning with interactive synthetic data collection. On ProofNet, HERALD achieves 23.7% Pass@64—matching the state of the art—while setting a new SOTA of 56.7% Pass@64 on the newly introduced algebra benchmark FATE-M.
📝 Abstract
Nowadays, formal theorem provers have made monumental progress on high-school and competition-level mathematics, but few of them generalize to more advanced mathematics. In this paper, we present REAL-Prover, a new open-source stepwise theorem prover for Lean 4 to push this boundary. This prover, based on our fine-tuned large language model (REAL-Prover-v1) and integrated with a retrieval system (Leansearch-PS), notably boosts performance on solving college-level mathematics problems. To train REAL-Prover-v1, we developed HERALD-AF, a data extraction pipeline that converts natural language math problems into formal statements, and a new open-source Lean 4 interactive environment (Jixia-interactive) to facilitate synthesis data collection. In our experiments, our prover using only supervised fine-tune achieves competitive results with a 23.7% success rate (Pass@64) on the ProofNet dataset-comparable to state-of-the-art (SOTA) models. To further evaluate our approach, we introduce FATE-M, a new benchmark focused on algebraic problems, where our prover achieves a SOTA success rate of 56.7% (Pass@64).