🤖 AI Summary
To address the challenge of verifying the correctness of code generated by large language models (LLMs), this paper proposes an end-to-end neural theorem proving framework. First, generated code is translated into natural-language specifications; second, a two-stage fine-tuned LLM—combining supervised fine-tuning (SFT) and PPO-based reinforcement learning—generates formal proofs in Isabelle/HOL; third, a heuristic module performs structured verification. The work introduces a novel three-stage pipeline: “natural language → formal proof → structured verification.” It also releases FVEL_ER, the first benchmark dataset tailored for access-control policy code verification. Experiments demonstrate significant improvements in proof success rate on miniF2F-test and achieve, for the first time, fully automated formal verification of AWS S3 policy code. All training data and source code are publicly released.
📝 Abstract
Formally verifying properties of software code has been a highly desirable task, especially with the emergence of LLM-generated code. In the same vein, they provide an interesting avenue for the exploration of formal verification and mechanistic interpretability. Since the introduction of code-specific models, despite their successes in generating code in Lean4 and Isabelle, the task of generalized theorem proving still remains far from being fully solved and will be a benchmark for reasoning capability in LLMs. In this work, we introduce a framework that generates whole proofs in a formal language to be used within systems that utilize the power of built-in tactics and off-the-shelf automated theorem provers. Our framework includes 3 components: generating natural language statements of the code to be verified, an LLM that generates formal proofs for the given statement, and a module employing heuristics for building the final proof. To train the LLM, we employ a 2-stage fine-tuning process, where we first use SFT-based training to enable the model to generate syntactically correct Isabelle code and then RL-based training that encourages the model to generate proofs verified by a theorem prover. We validate our framework using the miniF2F-test benchmark and the Isabelle proof assistant and design a use case to verify the correctness of the AWS S3 bucket access policy code. We also curate a dataset based on the FVEL extsubscript{ extnormal{ER}} dataset for future training tasks.