CLEVER: A Curated Benchmark for Formally Verified Code Generation

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Existing end-to-end formal verification code generation lacks high-quality, human-curated evaluation benchmarks free from test-case supervision, LLM-generated annotations, or specification leakage. Method: We construct the first such benchmark—Lean-161—a rigorously hand-curated dataset of 161 problems in Lean, requiring simultaneous generation of machine-verifiable formal specifications and their implementations, with correctness judged solely by Lean’s type checker. We introduce the “full-verification closed-loop” evaluation paradigm, which strictly excludes vacuous solutions and implementation logic leakage. Contribution/Results: Lean-161 establishes a ground-truth standard for end-to-end formal synthesis, enabling purely mechanized, specification-free correctness assessment. Empirical evaluation reveals that current state-of-the-art language models perform substantially below human-level capability on this benchmark, underscoring formal code generation as a critical frontier challenge at the intersection of program synthesis and automated reasoning.

Technology Category

Application Category

📝 Abstract

We introduce ${ m C{small LEVER}}$, a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks, ${ m C{small LEVER}}$ avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness. We use ${ m C{small LEVER}}$ to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning. Our benchmark can be found on GitHub(https://github.com/trishullab/clever) as well as HuggingFace(https://huggingface.co/datasets/amitayusht/clever). All our evaluation code is also available online(https://github.com/trishullab/clever-prover).

Problem

Research questions and friction points this paper is trying to address.

Creating a benchmark for verified code generation in Lean

Generating specifications and implementations with formal correctness

Evaluating few-shot and agentic approaches for program synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

High-quality benchmark for verified code generation

Specification and implementation tasks with Lean

Post-hoc verification using Lean's type checker

🔎 Similar Papers

No similar papers found.