🤖 AI Summary
The scarcity of high-quality, formally verified code severely constrains the training and application of large language models (LLMs) in program verification.
Method: We propose the first automated data synthesis pipeline tailored for formal verification, grounded in the Dafny language. Our approach introduces a novel multi-stage task decomposition paradigm that jointly generates specifications, implementations, and machine-checkable proofs, yielding over seven fine-grained training samples per program.
Contribution/Results: We construct the largest verified-code dataset to date—comprising 2,700 fully verified programs and over 19,000 samples—and perform verification-aware fine-tuning and synthetic-data distillation on Qwen2.5-7B-Coder. Experiments demonstrate absolute accuracy improvements of 23% on DafnyBench and 50% on DafnySynthesis, substantially alleviating the data bottleneck for LLMs in program verification.
📝 Abstract
Large language models have shown potential for program verification, but progress is hindered by the scarcity of verified code for training. We present ATLAS, an automated pipeline that synthesizes verified programs at scale to address this data bottleneck. ATLAS generates complete Dafny programs with specifications, implementations, and proofs, producing 2.7K verified programs from which we extract over 19K training examples--more than 7 per verified program--by decomposing the synthesis process into multiple specialized tasks. Fine-tuning Qwen 2.5 7B Coder on this dataset produces substantial gains: +23 percentage points on DafnyBench and +50 percentage points on DafnySynthesis. These results demonstrate that synthetic verified code can effectively enhance LLM capabilities for formal verification.