MiniF2F-Dafny: LLM-Guided Mathematical Theorem Proving via Auto-Active Verification

📅 2025-12-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The fragmentation between interactive and automated theorem proving tools hinders progress in formal mathematical verification. Method: This work bridges this gap by porting the miniF2F benchmark to the automated verifier Dafny—the first such adaptation—introducing a collaborative paradigm wherein large language models (LLMs) generate high-level proof sketches, while Dafny performs low-level, fully automated verification. We evaluate 12 open-source LLMs using iterative error correction and an empty-proof baseline. Contribution/Results: Our Dafny-miniF2F benchmark establishes a new evaluation framework for LLM-guided formal reasoning. Experiments show a 40.6% pass rate on the test set for empty proofs; with LLM-generated prompts, the best model achieves 55.7% pass@4. This work extends miniF2F’s applicability, empirically validates the feasibility and efficacy of LLM–formal-verifier collaboration, and opens a scalable, low-barrier pathway toward automating rigorous mathematical reasoning.

Technology Category

Application Category

📝 Abstract
We present miniF2F-Dafny, the first translation of the mathematical reasoning benchmark miniF2F to an automated theorem prover: Dafny. Previously, the benchmark existed only in interactive theorem provers (Lean, Isabelle, HOL Light, Metamath). We find that Dafny's automation verifies 99/244 (40.6%) of the test set and 109/244 (44.7%) of the validation set with empty proofs--requiring no manual proof steps. For problems where empty proofs fail, we evaluate 12 off-the-shelf LLMs on providing proof hints. The best model we test achieves 55.7% pass@4 success rate employing iterative error correction. These preliminary results highlight an effective division of labor: LLMs provide high-level guidance while automation handles low-level details. Our benchmark can be found on GitHub at http://github.com/dafny-lang/miniF2F .
Problem

Research questions and friction points this paper is trying to address.

Translates mathematical reasoning benchmarks to automated theorem provers
Evaluates LLMs' ability to provide proof hints when automation fails
Establishes collaborative workflow between AI guidance and automated verification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Translation of miniF2F benchmark to Dafny automated theorem prover
Using LLMs to provide proof hints for unsolved problems
Iterative error correction by LLMs to enhance success rates
🔎 Similar Papers
No similar papers found.