A Study On Mixup-inspired Augmentation Methods For Software Vulnerability Detection

πŸ“… 2025-04-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the scarcity of real-world vulnerability data, severe class imbalance, and limited practicality of existing augmentation techniques in software vulnerability detection, this paper proposes, for the first time, Mixup-style data augmentation in the code representation spaceβ€”rather than at the source-code level. To preserve vulnerability-critical semantics, we design a conditional augmentation mechanism that constrains embedding transformations to keep sensitive regions invariant. Leveraging code embeddings, we develop five representation-level augmentation methods (e.g., CodeMixup, EmbMixup) and systematically evaluate them under both conditional constraints and random oversampling. Experimental results demonstrate that our approach achieves up to a 9.67% improvement in F1-score, strongly validating the efficacy of representation-level augmentation. Although marginally below random oversampling (+10.82%), our method establishes a novel paradigm and benchmark for hybrid augmentation strategies in vulnerability detection.

Technology Category

Application Category

πŸ“ Abstract
Various Deep Learning (DL) methods have recently been utilized to detect software vulnerabilities. Real-world software vulnerability datasets are rare and hard to acquire as there's no simple metric for classifying vulnerability. Such datasets are heavily imbalanced, and none of the current datasets are considered huge for DL models. To tackle these problems a recent work has tried to augment the dataset using the source code and generate realistic single-statement vulnerabilities which is not quite practical and requires manual checking of the generated vulnerabilities. In this regard, we aim to explore the augmentation of vulnerabilities at the representation level to help current models learn better which has never been done before to the best of our knowledge. We implement and evaluate the 5 augmentation techniques that augment the embedding of the data and recently have been used for code search which is a completely different software engineering task. We also introduced a conditioned version of those augmentation methods, which ensures the augmentation does not change the vulnerable section of the vector representation. We show that such augmentation methods can be helpful and increase the f1-score by up to 9.67%, yet they cannot beat Random Oversampling when balancing datasets which increases the f1-score by 10.82%!
Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of real-world software vulnerability datasets
Exploring representation-level augmentation for vulnerability detection
Improving DL model performance with embedding augmentation techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixup-inspired augmentation for vulnerability detection
Augmentation at representation level for better learning
Conditioned augmentation preserves vulnerable sections
πŸ”Ž Similar Papers
No similar papers found.
Seyed Shayan Daneshvar
Seyed Shayan Daneshvar
CS Master's Graduate, University of Manitoba
AI4SEAIOpsComputer VisionDeep LearningGenerative AI
D
Da Tan
University of Manitoba, Winnipeg, Manitoba, Canada
S
Shaowei Wang
University of Manitoba, Winnipeg, Manitoba, Canada
C
Carson Leung
University of Manitoba, Winnipeg, Manitoba, Canada