π€ AI Summary
To address the scarcity of real-world vulnerability data, severe class imbalance, and limited practicality of existing augmentation techniques in software vulnerability detection, this paper proposes, for the first time, Mixup-style data augmentation in the code representation spaceβrather than at the source-code level. To preserve vulnerability-critical semantics, we design a conditional augmentation mechanism that constrains embedding transformations to keep sensitive regions invariant. Leveraging code embeddings, we develop five representation-level augmentation methods (e.g., CodeMixup, EmbMixup) and systematically evaluate them under both conditional constraints and random oversampling. Experimental results demonstrate that our approach achieves up to a 9.67% improvement in F1-score, strongly validating the efficacy of representation-level augmentation. Although marginally below random oversampling (+10.82%), our method establishes a novel paradigm and benchmark for hybrid augmentation strategies in vulnerability detection.
π Abstract
Various Deep Learning (DL) methods have recently been utilized to detect software vulnerabilities. Real-world software vulnerability datasets are rare and hard to acquire as there's no simple metric for classifying vulnerability. Such datasets are heavily imbalanced, and none of the current datasets are considered huge for DL models. To tackle these problems a recent work has tried to augment the dataset using the source code and generate realistic single-statement vulnerabilities which is not quite practical and requires manual checking of the generated vulnerabilities. In this regard, we aim to explore the augmentation of vulnerabilities at the representation level to help current models learn better which has never been done before to the best of our knowledge. We implement and evaluate the 5 augmentation techniques that augment the embedding of the data and recently have been used for code search which is a completely different software engineering task. We also introduced a conditioned version of those augmentation methods, which ensures the augmentation does not change the vulnerable section of the vector representation. We show that such augmentation methods can be helpful and increase the f1-score by up to 9.67%, yet they cannot beat Random Oversampling when balancing datasets which increases the f1-score by 10.82%!