VulnRepairEval: An Exploit-Based Evaluation Framework for Assessing Large Language Model Vulnerability Repair Capabilities

📅 2025-09-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based vulnerability repair evaluations predominantly rely on superficial validation, neglecting actual exploit mitigation capability—leading to inflated security assessments. This work introduces the first rigorous evaluation framework grounded in executable proof-of-concept (PoC) exploits: it establishes a containerized, isolated automated verification pipeline incorporating differential testing, and benchmarks against 23 real-world Python vulnerabilities, requiring repaired code to definitively block the original exploit. Experimental evaluation across 12 mainstream LLMs reveals a maximum repair success rate of only 21.7%, exposing fundamental deficiencies in vulnerability semantic understanding and repair correctness. Enhanced prompting and multi-agent strategies yield only marginal improvements. Our framework establishes a new, security-critical evaluation paradigm for LLMs’ vulnerability repair capability—shifting focus from syntactic correctness to functional exploit resistance.

Technology Category

Application Category

📝 Abstract
The adoption of Large Language Models (LLMs) for automated software vulnerability patching has shown promising outcomes on carefully curated evaluation sets. Nevertheless, existing datasets predominantly rely on superficial validation methods rather than exploit-based verification, leading to overestimated performance in security-sensitive applications. This paper introduces VulnRepairEval, an evaluation framework anchored in functional Proof-of-Concept (PoC) exploits. Our framework delivers a comprehensive, containerized evaluation pipeline that enables reproducible differential assessment, where repair success requires the original exploit to fail execution against the modified code. The benchmark construction involved extensive data curation: we processed over 400 CVEs and approximately 2,500 potential sources to extract a collection of authentic vulnerability instances (23 Python CVEs) amenable to automated testing with working PoCs. Through VulnRepairEval, we conduct a comprehensive evaluation of 12 popular LLMs and observe a significant performance deficit: even the top-performing model successfully addresses merely 5/23 instances (about 21.7%), exposing critical weaknesses in security-focused applications. Our failure analysis reveals that most unsuccessful attempts stem from imprecise vulnerability identification and patches containing syntactic or semantic errors. Enhanced prompting strategies and multi-agent approaches yield minimal improvements, with overall effectiveness remaining largely unaffected. This work contributes a stringent, practical evaluation framework for LLM-driven vulnerability remediation and underscores the necessity for assessment protocols that authentically reflect real-world exploitation scenarios.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLM vulnerability repair using exploit-based verification
Addresses overestimated performance from superficial validation methods
Assesses real-world security application effectiveness via PoC exploits
Innovation

Methods, ideas, or system contributions that make the work stand out.

Exploit-based evaluation framework for vulnerability repair
Containerized pipeline for reproducible differential assessment
Functional Proof-of-Concept exploits for authentic testing
🔎 Similar Papers
No similar papers found.
W
Weizhe Wang
Tianjin University, China
W
Wei Ma
Singapore Management University, Singapore
Q
Qiang Hu
Tianjin University, China
Y
Yao Zhang
Tianjin University, China
Jianfei Sun
Jianfei Sun
Nanyang Technological University
Applied Cryptography
B
Bin Wu
Tianjin University, China
Y
Yang Liu
Nanyang Technological University, Singapore
Guangquan Xu
Guangquan Xu
Tianjin University
Cyber Security,IoT Security,Trust Management,Trusted Computing
Lingxiao Jiang
Lingxiao Jiang
Professor of Computer Science, Singapore Management University
Software EngineeringData MiningCyber SecurityProgramming LanguagesSystems