Can LLMs Enable Verification in Mainstream Programming?

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Formal verification remains underutilized in practical programming, while large language models (LLMs) generate code lacking strong correctness guarantees. Method: This work conducts the first systematic evaluation of LLMs’ end-to-end capability to generate verifiable code in three mainstream verification languages—Dafny, Nagini, and Verus—using a human-curated, HumanEval-derived benchmark, augmented with prompt engineering and controllable generation techniques. Contribution/Results: We find that structured prompts—particularly those incorporating formal specifications alongside pre- and postconditions—significantly improve syntactic correctness and verification success rates. Performance varies across verification languages, reflecting inherent differences in expressiveness and tooling. Crucially, LLMs demonstrate foundational verification-aware code generation capabilities, including comprehension of specification intent and basic invariant reasoning. This study provides empirical evidence and methodological guidance for advancing trustworthy AI-assisted programming, bridging the gap between scalable code synthesis and formal assurance.

Technology Category

Application Category

📝 Abstract
Although formal methods are capable of producing reliable software, they have seen minimal adoption in everyday programming. Automatic code generation using large language models is becoming increasingly widespread, but it rarely considers producing strong correctness guarantees. In this study, we explore the ability of LLMs to produce verified code in three verification languages (Dafny, Nagini, and Verus). To do so, we use manually curated datasets derived from the state-ofthe-art Python benchmark, HumanEval. We also assess what types of information are sufficient to achieve good-quality results.
Problem

Research questions and friction points this paper is trying to address.

Explore LLMs' ability to generate verified code.
Assess sufficient information for high-quality verification.
Use datasets from Python benchmark for evaluation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs generate verified code in verification languages.
Manually curated datasets from HumanEval benchmark used.
Assess sufficient information for high-quality verified code.
🔎 Similar Papers
No similar papers found.
A
Aleksandr Shefer
JetBrains Research, Amsterdam, the Netherlands; Constructor University, Bremen, Germany
I
Igor Engel
JetBrains Research, Amsterdam, the Netherlands; Constructor University, Bremen, Germany
S
Stanislav Alekseev
JetBrains Research, Amsterdam, the Netherlands; Neapolis University, Pafos, Cyprus
Daniil Berezun
Daniil Berezun
JetBrains Research
computer science
Ekaterina Verbitskaia
Ekaterina Verbitskaia
JetBrains Research
metacomputationrelational programmingsupercompilationparsingcontext-free path querying
Anton Podkopaev
Anton Podkopaev
JetBrains Research, Constructor University Bremen
Programming languagesfunctional programmingverificationconcurrency