Let the Trial Begin: A Mock-Court Approach to Vulnerability Detection using LLM-Based Agents

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vulnerability detection in source code faces significant challenges due to the high functional similarity between benign and malicious functions. To address this, we propose VulTrial—a courtroom-inspired multi-agent framework that orchestrates four distinct roles (Security Researcher, Code Author, Moderator, and Jury) in an interpretable, iteration-controllable, interactive reasoning process. We introduce the first role-specific instruction-tuning method based on few-shot learning (50 role–instruction pairs), substantially enhancing generalization under low-resource constraints. Experiments show that GPT-4o-based VulTrial outperforms single-agent and multi-agent baselines by 102.39% and 84.17%, respectively; after role-specific fine-tuning, gains rise to 139.89% and 118.30%. Remarkably, the GPT-3.5 variant achieves superior performance to the GPT-4o single-agent baseline while reducing inference cost by 62%. Our core contributions include: (1) a novel courtroom-inspired paradigm for vulnerability reasoning; (2) role-driven few-shot instruction tuning; and (3) an efficient, cooperative multi-agent inference architecture.

Technology Category

Application Category

📝 Abstract
Detecting vulnerabilities in source code remains a critical yet challenging task, especially when benign and vulnerable functions share significant similarities. In this work, we introduce VulTrial, a courtroom-inspired multi-agent framework designed to enhance automated vulnerability detection. It employs four role-specific agents, which are security researcher, code author, moderator, and review board. Through extensive experiments using GPT-3.5 and GPT-4o we demonstrate that Vultrial outperforms single-agent and multi-agent baselines. Using GPT-4o, VulTrial improves the performance by 102.39% and 84.17% over its respective baseline. Additionally, we show that role-specific instruction tuning in multi-agent with small data (50 pair samples) improves the performance of VulTrial further by 139.89% and 118.30%. Furthermore, we analyze the impact of increasing the number of agent interactions on VulTrial's overall performance. While multi-agent setups inherently incur higher costs due to increased token usage, our findings reveal that applying VulTrial to a cost-effective model like GPT-3.5 can improve its performance by 69.89% compared to GPT-4o in a single-agent setting, at a lower overall cost.
Problem

Research questions and friction points this paper is trying to address.

Enhancing automated vulnerability detection in source code
Addressing challenges in distinguishing benign and vulnerable functions
Improving performance with multi-agent framework and role-specific tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Courtroom-inspired multi-agent framework for vulnerability detection
Four role-specific agents enhance detection accuracy
Role-specific instruction tuning improves performance significantly
🔎 Similar Papers
No similar papers found.
Ratnadira Widyasari
Ratnadira Widyasari
Singapore Management University
Computer science
M
M. Weyssow
Singapore Management University
I
I. Irsan
Singapore Management University
H
Han Wei Ang
GovTech
Frank Liauw
Frank Liauw
Lead Cybersecurity Engineer, Government Technology Agency Singapore
E
Eng Lieh Ouh
Singapore Management University
L
Lwin Khin Shar
Singapore Management University
Hong Jin Kang
Hong Jin Kang
University of Sydney
Software EngineeringSpecification MiningActive Learning
D
David Lo
Singapore Management University