🤖 AI Summary
This work addresses the automated synthesis of loop invariants for formal verification of looping programs. We propose a generate-and-verify closed-loop framework that tightly integrates large language models (LLMs)—specifically reasoning-optimized variants such as OpenAI O1, O1-mini, and O3-mini—with the Z3 SMT solver. Invariant synthesis proceeds via counterexample-guided iterative refinement, where verification feedback directly drives LLM inference optimization. To our knowledge, this is the first approach achieving deep, symbol-level synergy between LLMs and SMT solvers in formal verification. Evaluated on the Code2Inv benchmark (133 benchmarks), our method achieves 100% coverage—surpassing the prior state-of-the-art (107/133)—with an average of only 1–2 LLM invocations and runtime of 14–55 seconds per benchmark. The results demonstrate substantial improvements in precision, efficiency, and generalization, validating the substantive potential of LLMs in formal deductive reasoning.
📝 Abstract
Loop invariants are essential for proving the correctness of programs with loops. Developing loop invariants is challenging, and fully automatic synthesis cannot be guaranteed for arbitrary programs. Some approaches have been proposed to synthesize loop invariants using symbolic techniques and more recently using neural approaches. These approaches are able to correctly synthesize loop invariants only for subsets of standard benchmarks. In this work, we investigate whether modern, reasoning-optimized large language models can do better. We integrate OpenAI's O1, O1-mini, and O3-mini into a tightly coupled generate-and-check pipeline with the Z3 SMT solver, using solver counterexamples to iteratively guide invariant refinement. We use Code2Inv benchmark, which provides C programs along with their formal preconditions and postconditions. On this benchmark of 133 tasks, our framework achieves 100% coverage (133 out of 133), outperforming the previous best of 107 out of 133, while requiring only 1-2 model proposals per instance and 14-55 seconds of wall-clock time. These results demonstrate that LLMs possess latent logical reasoning capabilities which can help automate loop invariant synthesis. While our experiments target C-specific programs, this approach should be generalizable to other imperative languages.