🤖 AI Summary
Existing static analysis tools and LLM-based approaches suffer from limited coverage, poor adaptability to diverse bug types, and inadequate capability in identifying complex defects. To address these challenges, this paper proposes BugScope—a novel LLM-driven multi-agent defect detection framework inspired by human code auditing practices. BugScope innovatively emulates how auditors inductively infer bug patterns from both positive and negative examples, integrating program slicing, retrieval-augmented generation (RAG), and domain-specific prompt engineering to dynamically construct context-aware retrieval strategies and reasoning prompts. The framework enables cross-project, context-sensitive defect detection. Evaluated on 40 real-world bugs, it achieves 87.04% precision and 90.00% recall—significantly outperforming industrial-grade tools in F1-score. Furthermore, BugScope identified 141 previously unknown defects in large-scale systems such as Linux; 78 have been patched, and 7 confirmed by developers.
📝 Abstract
Detecting software bugs remains a fundamental challenge due to the extensive diversity of real-world defects. Traditional static analysis tools often rely on symbolic workflows, which restrict their coverage and hinder adaptability to customized bugs with diverse anti-patterns. While recent advances incorporate large language models (LLMs) to enhance bug detection, these methods continue to struggle with sophisticated bugs and typically operate within limited analysis contexts. To address these challenges, we propose BugScope, an LLM-driven multi-agent system that emulates how human auditors learn new bug patterns from representative examples and apply that knowledge during code auditing. Given a set of examples illustrating both buggy and non-buggy behaviors, BugScope synthesizes a retrieval strategy to extract relevant detection contexts via program slicing and then constructs a tailored detection prompt to guide accurate reasoning by the LLM. Our evaluation on a curated dataset of 40 real-world bugs drawn from 21 widely-used open-source projects demonstrates that BugScope achieves 87.04% precision and 90.00% recall, surpassing state-of-the-art industrial tools by 0.44 in F1 score. Further testing on large-scale open-source systems, including the Linux kernel, uncovered 141 previously unknown bugs, of which 78 have been fixed and 7 confirmed by developers, highlighting BugScope's substantial practical impact.