🤖 AI Summary
Vulnerability detection in source code faces significant challenges due to the high functional similarity between benign and malicious functions. To address this, we propose VulTrial—a courtroom-inspired multi-agent framework that orchestrates four distinct roles (Security Researcher, Code Author, Moderator, and Jury) in an interpretable, iteration-controllable, interactive reasoning process. We introduce the first role-specific instruction-tuning method based on few-shot learning (50 role–instruction pairs), substantially enhancing generalization under low-resource constraints. Experiments show that GPT-4o-based VulTrial outperforms single-agent and multi-agent baselines by 102.39% and 84.17%, respectively; after role-specific fine-tuning, gains rise to 139.89% and 118.30%. Remarkably, the GPT-3.5 variant achieves superior performance to the GPT-4o single-agent baseline while reducing inference cost by 62%. Our core contributions include: (1) a novel courtroom-inspired paradigm for vulnerability reasoning; (2) role-driven few-shot instruction tuning; and (3) an efficient, cooperative multi-agent inference architecture.
📝 Abstract
Detecting vulnerabilities in source code remains a critical yet challenging task, especially when benign and vulnerable functions share significant similarities. In this work, we introduce VulTrial, a courtroom-inspired multi-agent framework designed to enhance automated vulnerability detection. It employs four role-specific agents, which are security researcher, code author, moderator, and review board. Through extensive experiments using GPT-3.5 and GPT-4o we demonstrate that Vultrial outperforms single-agent and multi-agent baselines. Using GPT-4o, VulTrial improves the performance by 102.39% and 84.17% over its respective baseline. Additionally, we show that role-specific instruction tuning in multi-agent with small data (50 pair samples) improves the performance of VulTrial further by 139.89% and 118.30%. Furthermore, we analyze the impact of increasing the number of agent interactions on VulTrial's overall performance. While multi-agent setups inherently incur higher costs due to increased token usage, our findings reveal that applying VulTrial to a cost-effective model like GPT-3.5 can improve its performance by 69.89% compared to GPT-4o in a single-agent setting, at a lower overall cost.