🤖 AI Summary
To address the data scarcity challenge confronting deep learning–based vulnerability detectors (DLVDs), this paper proposes VulScribeR—a Retrieval-Augmented Generation (RAG)-enhanced large language model framework built upon LLaMA-3 and CodeLlama. VulScribeR is the first to systematically explore controllable generation of multi-type, multi-statement vulnerable code. It introduces three vulnerability-preserving augmentation strategies—Mutation, Injection, and Extension—to achieve semantically consistent, fine-grained (single- or multi-statement) code enhancement. Customized prompt templates and vulnerability-oriented context expansion further improve generation fidelity. Evaluated on three benchmark datasets, VulScribeR achieves average F1-score improvements of 27.48%–69.90% over baselines, with a generation cost of only $1.88 per thousand samples—significantly outperforming Vulgen, VGX, and random oversampling.
📝 Abstract
Detecting vulnerabilities is vital for software security, yet deep learning-based vulnerability detectors (DLVD) face a data shortage, which limits their effectiveness. Data augmentation can potentially alleviate the data shortage, but augmenting vulnerable code is challenging and requires a generative solution that maintains vulnerability. Previous works have only focused on generating samples that contain single statements or specific types of vulnerabilities. Recently, large language models (LLMs) have been used to solve various code generation and comprehension tasks with inspiring results, especially when fused with retrieval augmented generation (RAG). Therefore, we propose VulScribeR, a novel LLM-based solution that leverages carefully curated prompt templates to augment vulnerable datasets. More specifically, we explore three strategies to augment both single and multi-statement vulnerabilities, with LLMs, namely Mutation, Injection, and Extension. Our extensive evaluation across three vulnerability datasets and DLVD models, using two LLMs, show that our approach beats two SOTA methods Vulgen and VGX, and Random Oversampling (ROS) by 27.48%, 27.93%, and 15.41% in f1-score with 5K generated vulnerable samples on average, and 53.84%, 54.10%, 69.90%, and 40.93% with 15K generated vulnerable samples. Our approach demonstrates its feasibility for large-scale data augmentation by generating 1K samples at as cheap as US$ 1.88.