A Multi-Dataset Evaluation of Models for Automated Vulnerability Repair

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Automated Vulnerability Repair (AVR) remains hindered by poor generalization and limited cross-language/cross-dataset adaptability. This paper presents the first systematic evaluation of CodeBERT and CodeT5 for AVR across six vulnerability datasets and four programming languages (C, C++, Java, Python), using supervised fine-tuning and comprehensive multi-dimensional benchmarking. Key contributions are: (1) uncovering their complementary strengths—CodeBERT excels in sparse-context scenarios, whereas CodeT5 outperforms in complex vulnerability identification and model scalability; (2) demonstrating that fine-tuning substantially improves in-distribution repair accuracy but fails to enhance cross-distribution generalization, exposing a fundamental bottleneck in current AVR methods; and (3) establishing the first unified AVR evaluation framework supporting cross-dataset and cross-language assessment. Empirical results provide rigorous evidence on the applicability of pre-trained code models for security-critical repair tasks and identify concrete directions for future improvement.

Technology Category

Application Category

📝 Abstract

Software vulnerabilities pose significant security threats, requiring effective mitigation. While Automated Program Repair (APR) has advanced in fixing general bugs, vulnerability patching, a security-critical aspect of APR remains underexplored. This study investigates pre-trained language models, CodeBERT and CodeT5, for automated vulnerability patching across six datasets and four languages. We evaluate their accuracy and generalization to unknown vulnerabilities. Results show that while both models face challenges with fragmented or sparse context, CodeBERT performs comparatively better in such scenarios, whereas CodeT5 excels in capturing complex vulnerability patterns. CodeT5 also demonstrates superior scalability. Furthermore, we test fine-tuned models on both in-distribution (trained) and out-of-distribution (unseen) datasets. While fine-tuning improves in-distribution performance, models struggle to generalize to unseen data, highlighting challenges in robust vulnerability detection. This study benchmarks model performance, identifies limitations in generalization, and provides actionable insights to advance automated vulnerability patching for real-world security applications.

Problem

Research questions and friction points this paper is trying to address.

Evaluating models for automated vulnerability repair across datasets

Assessing generalization of models to unknown vulnerabilities

Identifying limitations in robust vulnerability detection and patching

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates CodeBERT and CodeT5 for vulnerability patching

Tests models on in-distribution and out-distribution datasets

Identifies challenges in generalization for unseen vulnerabilities

🔎 Similar Papers

APPATCH: Automated Adaptive Prompting Large Language Models for Real-World Software Vulnerability Patching