FaultLine: Automated Proof-of-Vulnerability Generation Using LLM Agents

📅 2025-07-21

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Software security vulnerability reports frequently lack reproducible proof-of-vulnerability (PoV) test cases, hindering patch validation and regression testing. To address this, we propose a hierarchical-reasoning LLM agent framework that generates PoV tests without relying on language-specific static or dynamic analysis tools. Our approach integrates source-to-sink data-flow tracking, branch condition solving, and feedback-driven multi-round reasoning to automatically synthesize inputs satisfying vulnerability-triggering conditions. The framework supports multiple programming languages—including Java and C/C++—and operates directly on source code. Evaluated on a benchmark of 100 real-world vulnerabilities, it successfully generated 16 PoVs, achieving a 77% higher success rate than CodeAct 2.1 (9 PoVs). This significantly improves vulnerability understandability and patch reliability. All code and the benchmark dataset are publicly released.

Technology Category

Application Category

📝 Abstract

Despite the critical threat posed by software security vulnerabilities, reports are often incomplete, lacking the proof-of-vulnerability (PoV) tests needed to validate fixes and prevent regressions. These tests are crucial not only for ensuring patches work, but also for helping developers understand how vulnerabilities can be exploited. Generating PoV tests is a challenging problem, requiring reasoning about the flow of control and data through deeply nested levels of a program. We present FaultLine, an LLM agent workflow that uses a set of carefully designed reasoning steps, inspired by aspects of traditional static and dynamic program analysis, to automatically generate PoV test cases. Given a software project with an accompanying vulnerability report, FaultLine 1) traces the flow of an input from an externally accessible API ("source") to the "sink" corresponding to the vulnerability, 2) reasons about the conditions that an input must satisfy in order to traverse the branch conditions encountered along the flow, and 3) uses this reasoning to generate a PoV test case in a feedback-driven loop. FaultLine does not use language-specific static or dynamic analysis components, which enables it to be used across programming languages. To evaluate FaultLine, we collate a challenging multi-lingual dataset of 100 known vulnerabilities in Java, C and C++ projects. On this dataset, FaultLine is able to generate PoV tests for 16 projects, compared to just 9 for CodeAct 2.1, a popular state-of-the-art open-source agentic framework. Thus, FaultLine represents a 77% relative improvement over the state of the art. Our findings suggest that hierarchical reasoning can enhance the performance of LLM agents on PoV test generation, but the problem in general remains challenging. We make our code and dataset publicly available in the hope that it will spur further research in this area.

Problem

Research questions and friction points this paper is trying to address.

Automates proof-of-vulnerability test generation for software security

Traces input flow from source to sink across program branches

Enables cross-language PoV testing without static/dynamic analysis components

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM agent workflow for PoV test generation

Multi-language vulnerability tracing without analysis components

Feedback-driven loop for conditional input reasoning

🔎 Similar Papers

No similar papers found.

Authors to Follow