Trace Gadgets: Minimizing Code Context for Machine Learning-Based Vulnerability Prediction

📅 2025-04-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the problem that redundant code context degrades model understanding in ML-based vulnerability detection, this paper proposes Trace Gadgets: a minimal code representation method based on execution path minimization, retaining only statements essential to reach the vulnerability while preserving both path completeness and contextual conciseness. Key contributions include: (1) introducing the first execution-path-minimization representation paradigm; (2) constructing the first large-scale, real-world application vulnerability-annotated dataset; and (3) integrating program slicing, dynamic taint analysis, and lightweight pre-trained models (CodeBERT/GraphCodeBERT) for collaborative static analysis and human-verified labeling. Experiments demonstrate that our approach achieves an F1-score improvement of ≥4% over industrial scanners (e.g., CodeQL) on completely unseen projects and discovers multiple previously unknown vulnerabilities in real-world software, leading to assigned CVE identifiers.

Technology Category

Application Category

📝 Abstract
As the number of web applications and API endpoints exposed to the Internet continues to grow, so does the number of exploitable vulnerabilities. Manually identifying such vulnerabilities is tedious. Meanwhile, static security scanners tend to produce many false positives. While machine learning-based approaches are promising, they typically perform well only in scenarios where training and test data are closely related. A key challenge for ML-based vulnerability detection is providing suitable and concise code context, as excessively long contexts negatively affect the code comprehension capabilities of machine learning models, particularly smaller ones. This work introduces Trace Gadgets, a novel code representation that minimizes code context by removing non-related code. Trace Gadgets precisely capture the statements that cover the path to the vulnerability. As input for ML models, Trace Gadgets provide a minimal but complete context, thereby improving the detection performance. Moreover, we collect a large-scale dataset generated from real-world applications with manually curated labels to further improve the performance of ML-based vulnerability detectors. Our results show that state-of-the-art machine learning models perform best when using Trace Gadgets compared to previous code representations, surpassing the detection capabilities of industry-standard static scanners such as GitHub's CodeQL by at least 4% on a fully unseen dataset. By applying our framework to real-world applications, we identify and report previously unknown vulnerabilities in widely deployed software.
Problem

Research questions and friction points this paper is trying to address.

Minimizing code context for ML vulnerability detection
Reducing false positives in static security scanners
Improving ML model performance with concise code representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Minimizes code context via Trace Gadgets
Uses path-covering statements for vulnerabilities
Improves ML detection with curated dataset
🔎 Similar Papers
No similar papers found.
F
Felix Machtle
Institute for IT Security, University of Luebeck, Germany
Nils Loose
Nils Loose
University of Lübeck
T
Tim Schulz
University of Hamburg, Institute for Humanities-Centered Artificial Intelligence, Germany
F
Florian Sieck
Institute for IT Security, University of Luebeck, Germany
J
Jan-Niclas Serr
Institute for IT Security, University of Luebeck, Germany
R
Ralf Moller
University of Hamburg, Institute for Humanities-Centered Artificial Intelligence, Germany
Thomas Eisenbarth
Thomas Eisenbarth
University of Lübeck
Computer SecurityApplied CryptographyPrivacy