Evaluation of LLMs for Mathematical Formalization in Lean

📅 2026-06-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

189K/year
🤖 AI Summary
This study systematically evaluates the effectiveness of large language models in generating formal mathematical proofs in Lean 4. Leveraging the miniF2F and miniCTX datasets, the authors conduct a comprehensive assessment of prominent models—including Gemini, Claude, Nemotron, and GPT-OSS—using both the established pass@$k$ metric and a newly introduced refine@$k$ measure, which jointly account for accuracy and reasoning cost. Experimental results show that Gemini 3.1 Pro achieves a refine@32 rate of 92% on miniF2F, while Claude Opus 4.7 reaches 86% on miniCTX. Notably, Nemotron 3 Super and GPT-OSS 120B deliver high performance at a cost of less than \$0.01 per correctly generated proof. This work pioneers the integration of economic efficiency into the evaluation framework for formal proof generation, proposing a more practical and application-oriented benchmarking paradigm.
📝 Abstract
Within the past few years, the ability of Large Language Models (LLMs) to generate formal mathematical proofs has improved drastically. We provide a comparison of various LLMs' effectiveness in producing formal proofs in Lean 4 with the goal of assisting those seeking to use LLMs to support their own projects. We utilize both pass@$k$ and refine@$k$ metrics as the benchmark for our comparison and evaluate on subsets of both miniF2F and miniCTX datasets. Our testing shows that overall, Gemini 3.1 Pro and Claude Opus 4.7 perform best. Gemini 3.1 Pro achieved a 92\% success rate on miniF2F via refine@32 whereas Opus 4.7 achieved a 86\% success rate on miniCTX via refine@32. When taking cost into account, NVIDIA Nemotron 3 Super and GPT-OSS 120B were the most efficient, with competitive accuracies and average costs of $<\$0.01$ per correct proof.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Mathematical Formalization
Lean 4
Formal Proofs
Model Evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

formal proof generation
Lean 4
refine@k metric
cost-efficiency evaluation
large language models
🔎 Similar Papers
T
Tyson Klingner
Math AI Lab, University of Washington, Seattle, WA, USA
D
Drew Bladek
Math AI Lab, University of Washington, Seattle, WA, USA
E
Escher Crawford
Math AI Lab, University of Washington, Seattle, WA, USA
B
Bohao Chen
Math AI Lab, University of Washington, Seattle, WA, USA
A
Ariel Fu
Math AI Lab, University of Washington, Seattle, WA, USA
K
Kaira Nair
Math AI Lab, University of Washington, Seattle, WA, USA
Jarod Alper
Jarod Alper
Senior Lecturer of Mathematics, Australian National University
Algebraic Geometry
G
Giovanni Inchiostro
Math AI Lab, University of Washington, Seattle, WA, USA
Vasily Ilin
Vasily Ilin
University of Washington
samplingneural networksLandau equation