🤖 AI Summary
The fragmentation between interactive and automated theorem proving tools hinders progress in formal mathematical verification.
Method: This work bridges this gap by porting the miniF2F benchmark to the automated verifier Dafny—the first such adaptation—introducing a collaborative paradigm wherein large language models (LLMs) generate high-level proof sketches, while Dafny performs low-level, fully automated verification. We evaluate 12 open-source LLMs using iterative error correction and an empty-proof baseline.
Contribution/Results: Our Dafny-miniF2F benchmark establishes a new evaluation framework for LLM-guided formal reasoning. Experiments show a 40.6% pass rate on the test set for empty proofs; with LLM-generated prompts, the best model achieves 55.7% pass@4. This work extends miniF2F’s applicability, empirically validates the feasibility and efficacy of LLM–formal-verifier collaboration, and opens a scalable, low-barrier pathway toward automating rigorous mathematical reasoning.
📝 Abstract
We present miniF2F-Dafny, the first translation of the mathematical reasoning benchmark miniF2F to an automated theorem prover: Dafny. Previously, the benchmark existed only in interactive theorem provers (Lean, Isabelle, HOL Light, Metamath). We find that Dafny's automation verifies 99/244 (40.6%) of the test set and 109/244 (44.7%) of the validation set with empty proofs--requiring no manual proof steps. For problems where empty proofs fail, we evaluate 12 off-the-shelf LLMs on providing proof hints. The best model we test achieves 55.7% pass@4 success rate employing iterative error correction. These preliminary results highlight an effective division of labor: LLMs provide high-level guidance while automation handles low-level details. Our benchmark can be found on GitHub at http://github.com/dafny-lang/miniF2F .