Can Large Language Models Help Students Prove Software Correctness? An Experimental Study with Dafny

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This study investigates how large language models (LLMs) can support higher-order cognitive tasks—specifically, deductive program verification using the Dafny language—in computer science education. Method: A controlled experiment assessed students’ performance in constructing formal correctness proofs with and without ChatGPT assistance, while systematically logging and modeling human–LLM interaction patterns. Contribution/Results: We propose a novel, “prompt-quality-driven” paradigm for integrating LLMs into formal methods instruction; advocate co-designed tasks that scaffold deep reasoning rather than supplant it; and empirically demonstrate that LLMs significantly improve verification task completion rates—though efficacy is highly sensitive to prompt engineering. Crucially, we identify a pronounced tendency among students to overtrust LLM outputs, even when erroneous. Our findings provide both theoretical grounding and actionable pedagogical guidelines for responsibly incorporating AI assistance in formal verification education.

Technology Category

Application Category

📝 Abstract

Students in computing education increasingly use large language models (LLMs) such as ChatGPT. Yet, the role of LLMs in supporting cognitively demanding tasks, like deductive program verification, remains poorly understood. This paper investigates how students interact with an LLM when solving formal verification exercises in Dafny, a language that supports functional correctness, by allowing programmers to write formal specifications and automatically verifying that the implementation satisfies the specification. We conducted a mixed-methods study with master's students enrolled in a formal methods course. Each participant completed two verification problems, one with access to a custom ChatGPT interface, that logged all interactions, and the other without. We identified strategies used by successful students and assessed the level of trust students place in LLMs. % odo{Our findings show that something here} Our findings show that students perform significantly better when using ChatGPT; however, performance gains are tied to prompt quality. We conclude with practical recommendations for integrating LLMs into formal methods courses more effectively, including designing LLM-aware challenges that promote learning rather than substitution.

Problem

Research questions and friction points this paper is trying to address.

Investigating LLMs' role in aiding deductive program verification tasks

Assessing student performance with ChatGPT in Dafny verification exercises

Exploring effective LLM integration strategies for formal methods education

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using ChatGPT for formal verification in Dafny

Mixed-methods study with master's students

LLM-aware challenges to promote learning

🔎 Similar Papers

Do Large Code Models Understand Programming Concepts? Counterfactual Analysis for Code Predicates