HoarePrompt: Structural Reasoning About Program Correctness in Natural Language

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

This work addresses the correctness verification challenge arising from the semantic gap between natural language requirements and program code. Methodologically, it introduces a structured reasoning framework that (i) pioneers the integration of Hoare logic and strongest-postcondition calculus into LLM inference; (ii) designs a few-shot-driven k-induction mechanism for formal state modeling and verification of loops; and (iii) jointly optimizes state description generation and prompt engineering to improve the precision of mapping natural language to logical assertions. On the CoCoClaNeL benchmark, the proposed approach achieves a 62% improvement in MCC over Zero-shot Chain-of-Thought and a 93% gain over LLM-based test generation, with the k-induction component alone contributing 28% of the total improvement. This is the first work to deeply couple classical program verification theory with large language model reasoning, establishing an interpretable and scalable paradigm for trustworthy NL→Code verification.

Technology Category

Application Category

📝 Abstract

While software requirements are often expressed in natural language, verifying the correctness of a program against natural language requirements is a hard and underexplored problem. Large language models (LLMs) are promising candidates for addressing this challenge, however our experience shows that they are ineffective in this task, often failing to detect even straightforward bugs. To address this gap, we introduce HoarePrompt, a novel approach that adapts fundamental ideas from program analysis and verification to natural language artifacts. Drawing inspiration from the strongest postcondition calculus, HoarePrompt employs a systematic, step-by-step process in which an LLM generates natural language descriptions of reachable program states at various points in the code. To manage loops, we propose few-shot-driven k-induction, an adaptation of the k-induction method widely used in model checking. Once program states are described, HoarePrompt leverages the LLM to assess whether the program, annotated with these state descriptions, conforms to the natural language requirements. For evaluating the quality of classifiers of program correctness with respect to natural language requirements, we constructed CoCoClaNeL, a challenging dataset of solutions to programming competition problems. Our experiments show that HoarePrompt improves the MCC by 62% compared to directly using Zero-shot-CoT prompts for correctness classification. Furthermore, HoarePrompt outperforms a classifier that assesses correctness via LLM-based test generation by increasing the MCC by 93%. The inductive reasoning mechanism contributes a 28% boost to MCC, underscoring its effectiveness in managing loops.

Problem

Research questions and friction points this paper is trying to address.

Verifying program correctness against natural language requirements

Improving LLM effectiveness in detecting program bugs

Managing loops in program analysis via k-induction

Innovation

Methods, ideas, or system contributions that make the work stand out.

HoarePrompt uses strongest postcondition calculus for reasoning

Few-shot-driven k-induction manages loops effectively

LLM verifies program correctness via annotated state descriptions

🔎 Similar Papers

No similar papers found.