MiniF2F-Dafny: LLM-Guided Mathematical Theorem Proving via Auto-Active Verification

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

The fragmentation between interactive and automated theorem proving tools hinders progress in formal mathematical verification. Method: This work bridges this gap by porting the miniF2F benchmark to the automated verifier Dafny—the first such adaptation—introducing a collaborative paradigm wherein large language models (LLMs) generate high-level proof sketches, while Dafny performs low-level, fully automated verification. We evaluate 12 open-source LLMs using iterative error correction and an empty-proof baseline. Contribution/Results: Our Dafny-miniF2F benchmark establishes a new evaluation framework for LLM-guided formal reasoning. Experiments show a 40.6% pass rate on the test set for empty proofs; with LLM-generated prompts, the best model achieves 55.7% pass@4. This work extends miniF2F’s applicability, empirically validates the feasibility and efficacy of LLM–formal-verifier collaboration, and opens a scalable, low-barrier pathway toward automating rigorous mathematical reasoning.

Technology Category

Application Category

📝 Abstract

We present miniF2F-Dafny, the first translation of the mathematical reasoning benchmark miniF2F to an automated theorem prover: Dafny. Previously, the benchmark existed only in interactive theorem provers (Lean, Isabelle, HOL Light, Metamath). We find that Dafny's automation verifies 99/244 (40.6%) of the test set and 109/244 (44.7%) of the validation set with empty proofs--requiring no manual proof steps. For problems where empty proofs fail, we evaluate 12 off-the-shelf LLMs on providing proof hints. The best model we test achieves 55.7% pass@4 success rate employing iterative error correction. These preliminary results highlight an effective division of labor: LLMs provide high-level guidance while automation handles low-level details. Our benchmark can be found on GitHub at http://github.com/dafny-lang/miniF2F .

Problem

Research questions and friction points this paper is trying to address.

Translates mathematical reasoning benchmarks to automated theorem provers

Evaluates LLMs' ability to provide proof hints when automation fails

Establishes collaborative workflow between AI guidance and automated verification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Translation of miniF2F benchmark to Dafny automated theorem prover

Using LLMs to provide proof hints for unsolved problems

Iterative error correction by LLMs to enhance success rates

🔎 Similar Papers

No similar papers found.