🤖 AI Summary
Formal verification remains underutilized in practical programming, while large language models (LLMs) generate code lacking strong correctness guarantees. Method: This work conducts the first systematic evaluation of LLMs’ end-to-end capability to generate verifiable code in three mainstream verification languages—Dafny, Nagini, and Verus—using a human-curated, HumanEval-derived benchmark, augmented with prompt engineering and controllable generation techniques. Contribution/Results: We find that structured prompts—particularly those incorporating formal specifications alongside pre- and postconditions—significantly improve syntactic correctness and verification success rates. Performance varies across verification languages, reflecting inherent differences in expressiveness and tooling. Crucially, LLMs demonstrate foundational verification-aware code generation capabilities, including comprehension of specification intent and basic invariant reasoning. This study provides empirical evidence and methodological guidance for advancing trustworthy AI-assisted programming, bridging the gap between scalable code synthesis and formal assurance.
📝 Abstract
Although formal methods are capable of producing reliable software, they have seen minimal adoption in everyday programming. Automatic code generation using large language models is becoming increasingly widespread, but it rarely considers producing strong correctness guarantees. In this study, we explore the ability of LLMs to produce verified code in three verification languages (Dafny, Nagini, and Verus). To do so, we use manually curated datasets derived from the state-ofthe-art Python benchmark, HumanEval. We also assess what types of information are sufficient to achieve good-quality results.