🤖 AI Summary
This study systematically evaluates the effectiveness of large language models in generating formal mathematical proofs in Lean 4. Leveraging the miniF2F and miniCTX datasets, the authors conduct a comprehensive assessment of prominent models—including Gemini, Claude, Nemotron, and GPT-OSS—using both the established pass@$k$ metric and a newly introduced refine@$k$ measure, which jointly account for accuracy and reasoning cost. Experimental results show that Gemini 3.1 Pro achieves a refine@32 rate of 92% on miniF2F, while Claude Opus 4.7 reaches 86% on miniCTX. Notably, Nemotron 3 Super and GPT-OSS 120B deliver high performance at a cost of less than \$0.01 per correctly generated proof. This work pioneers the integration of economic efficiency into the evaluation framework for formal proof generation, proposing a more practical and application-oriented benchmarking paradigm.
📝 Abstract
Within the past few years, the ability of Large Language Models (LLMs) to generate formal mathematical proofs has improved drastically. We provide a comparison of various LLMs' effectiveness in producing formal proofs in Lean 4 with the goal of assisting those seeking to use LLMs to support their own projects. We utilize both pass@$k$ and refine@$k$ metrics as the benchmark for our comparison and evaluate on subsets of both miniF2F and miniCTX datasets. Our testing shows that overall, Gemini 3.1 Pro and Claude Opus 4.7 perform best. Gemini 3.1 Pro achieved a 92\% success rate on miniF2F via refine@32 whereas Opus 4.7 achieved a 86\% success rate on miniCTX via refine@32. When taking cost into account, NVIDIA Nemotron 3 Super and GPT-OSS 120B were the most efficient, with competitive accuracies and average costs of $<\$0.01$ per correct proof.