🤖 AI Summary
To address the problem that redundant code context degrades model understanding in ML-based vulnerability detection, this paper proposes Trace Gadgets: a minimal code representation method based on execution path minimization, retaining only statements essential to reach the vulnerability while preserving both path completeness and contextual conciseness. Key contributions include: (1) introducing the first execution-path-minimization representation paradigm; (2) constructing the first large-scale, real-world application vulnerability-annotated dataset; and (3) integrating program slicing, dynamic taint analysis, and lightweight pre-trained models (CodeBERT/GraphCodeBERT) for collaborative static analysis and human-verified labeling. Experiments demonstrate that our approach achieves an F1-score improvement of ≥4% over industrial scanners (e.g., CodeQL) on completely unseen projects and discovers multiple previously unknown vulnerabilities in real-world software, leading to assigned CVE identifiers.
📝 Abstract
As the number of web applications and API endpoints exposed to the Internet continues to grow, so does the number of exploitable vulnerabilities. Manually identifying such vulnerabilities is tedious. Meanwhile, static security scanners tend to produce many false positives. While machine learning-based approaches are promising, they typically perform well only in scenarios where training and test data are closely related. A key challenge for ML-based vulnerability detection is providing suitable and concise code context, as excessively long contexts negatively affect the code comprehension capabilities of machine learning models, particularly smaller ones. This work introduces Trace Gadgets, a novel code representation that minimizes code context by removing non-related code. Trace Gadgets precisely capture the statements that cover the path to the vulnerability. As input for ML models, Trace Gadgets provide a minimal but complete context, thereby improving the detection performance. Moreover, we collect a large-scale dataset generated from real-world applications with manually curated labels to further improve the performance of ML-based vulnerability detectors. Our results show that state-of-the-art machine learning models perform best when using Trace Gadgets compared to previous code representations, surpassing the detection capabilities of industry-standard static scanners such as GitHub's CodeQL by at least 4% on a fully unseen dataset. By applying our framework to real-world applications, we identify and report previously unknown vulnerabilities in widely deployed software.