Patch Validation in Automated Vulnerability Repair

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical gap in current automated vulnerability repair (AVR) systems: the widespread neglect of developer-provided, semantically rich tests—referred to as PoC⁺ tests—during patch validation, which leads to an overestimation of patch correctness. To systematically investigate this issue, we construct PVBench, a benchmark comprising 209 real-world vulnerability-fix pairs that integrates functional tests, proof-of-concept (PoC) exploit tests, and PoC⁺ tests. Through comprehensive evaluation of state-of-the-art LLM-driven AVR systems, we find that over 40% of patches deemed correct by basic test suites fail under PoC⁺ testing. This exposes significant shortcomings in existing approaches regarding root cause analysis, adherence to specifications, and comprehension of developer intent. Our findings advocate for a more rigorous and holistic standard for assessing patch validity.

Technology Category

Application Category

📝 Abstract
Automated Vulnerability Repair (AVR) systems, especially those leveraging large language models (LLMs), have demonstrated promising results in patching vulnerabilities -- that is, if we trust their patch validation methodology. Ground-truth patches from human developers often come with new tests that not only ensure mitigation of the vulnerability but also encode extra semantics such as root cause location, optimal fix strategy, or subtle coding styles or conventions. And yet, none of the recent AVR systems verify that the auto-generated patches additionally pass these new tests (termed as $\text{PoC}^+$ tests). This is a subtle yet critical omission. To fill this gap, we constructed a benchmark, $\textrm{PVBench}$, with 209 cases spanning 20 projects. Each case includes basic tests (functional tests before the patch and the PoC exploit) as well as the associated $\text{PoC}^+$ tests. Evaluated on three state-of-the-art AVR systems, we find that over 40\% of patches validated as correct by basic tests fail under $\text{PoC}^+$ testing, revealing substantial overestimation on patch success rates. Analyzing these patches that are falsely labeled as correct, we suggest that AVR tools should improve in three critical areas: root cause analysis, adherence to program specifications, and capturing developer intention.
Problem

Research questions and friction points this paper is trying to address.

Automated Vulnerability Repair
Patch Validation
PoC⁺ tests
Overestimation
Ground-truth patches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Patch Validation
Automated Vulnerability Repair
PoC⁺ Tests
LLM-based Repair
PVBench
Zheng Yu
Zheng Yu
Princeton University
machine learningoptimization
W
Wenxuan Shi
Northwestern University
X
Xinqian Sun
University of Waterloo
Zheyun Feng
Zheyun Feng
Software Engineer at Google Research
Image UnderstandingComputer VisionMachine LearningInformation Retrieval
M
Meng Xu
University of Waterloo
X
Xinyu Xing
Northwestern University