Trace Gadgets: Minimizing Code Context for Machine Learning-Based Vulnerability Prediction

📅 2025-04-18

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the problem that redundant code context degrades model understanding in ML-based vulnerability detection, this paper proposes Trace Gadgets: a minimal code representation method based on execution path minimization, retaining only statements essential to reach the vulnerability while preserving both path completeness and contextual conciseness. Key contributions include: (1) introducing the first execution-path-minimization representation paradigm; (2) constructing the first large-scale, real-world application vulnerability-annotated dataset; and (3) integrating program slicing, dynamic taint analysis, and lightweight pre-trained models (CodeBERT/GraphCodeBERT) for collaborative static analysis and human-verified labeling. Experiments demonstrate that our approach achieves an F1-score improvement of ≥4% over industrial scanners (e.g., CodeQL) on completely unseen projects and discovers multiple previously unknown vulnerabilities in real-world software, leading to assigned CVE identifiers.

Technology Category

Application Category

📝 Abstract

As the number of web applications and API endpoints exposed to the Internet continues to grow, so does the number of exploitable vulnerabilities. Manually identifying such vulnerabilities is tedious. Meanwhile, static security scanners tend to produce many false positives. While machine learning-based approaches are promising, they typically perform well only in scenarios where training and test data are closely related. A key challenge for ML-based vulnerability detection is providing suitable and concise code context, as excessively long contexts negatively affect the code comprehension capabilities of machine learning models, particularly smaller ones. This work introduces Trace Gadgets, a novel code representation that minimizes code context by removing non-related code. Trace Gadgets precisely capture the statements that cover the path to the vulnerability. As input for ML models, Trace Gadgets provide a minimal but complete context, thereby improving the detection performance. Moreover, we collect a large-scale dataset generated from real-world applications with manually curated labels to further improve the performance of ML-based vulnerability detectors. Our results show that state-of-the-art machine learning models perform best when using Trace Gadgets compared to previous code representations, surpassing the detection capabilities of industry-standard static scanners such as GitHub's CodeQL by at least 4% on a fully unseen dataset. By applying our framework to real-world applications, we identify and report previously unknown vulnerabilities in widely deployed software.

Problem

Research questions and friction points this paper is trying to address.

Minimizing code context for ML vulnerability detection

Reducing false positives in static security scanners

Improving ML model performance with concise code representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Minimizes code context via Trace Gadgets

Uses path-covering statements for vulnerabilities

Improves ML detection with curated dataset

🔎 Similar Papers

No similar papers found.

Authors to Follow