🤖 AI Summary
Current large language models (LLMs) exhibit suboptimal performance in cybersecurity tasks, particularly suffering from low detection rates and high false positive rates in vulnerability identification and black-box web application testing. To address these limitations, this work proposes a dual-mode benchmark—comprising VulnLLM-R and production-grade web applications—combined with a structured penetration testing methodology and a self-play security data generation strategy. The approach further integrates external tools such as Playwright and Burp Suite MCP to construct a domain-specific agent architecture. Experimental results demonstrate that general-purpose LLMs are ill-suited for real-world security tasks, whereas the proposed method achieves state-of-the-art precision (0.904) and the lowest false positive rate (9.7%) on a single GPU, thereby validating the necessity and efficacy of vertically specialized foundation models for cybersecurity.
📝 Abstract
We evaluate whether frontier LLMs are ready for cybersecurity through a dual-mode benchmark: white-box function-level vulnerability detection (VulnLLM-R, across C/Java/Python) and black-box web application security testing (five production-style applications with 118 ground-truth vulnerabilities across 20+ CWE families, which we will open-source). We test six frontier models (GPT-5.4, Codex~5.3, Claude Opus~4.6, Sonnet~4.6, Gemini~3.1~Pro and Gemini~3~Flash) and two domain-specialized models across four testing paradigms. Our findings are sobering: (1)~every frontier model produces 10-50% false positive rates in white-box detection, systematically over-predicting vulnerabilities; (2)~in black-box testing, frontier models achieve only 4-8% ground-truth coverage, improving to just 10-19% even with external security tools (Playwright MCP, Burp Suite MCP); (3)~structured penetration-testing methodology encoded in domain-specialized agents raises per-family detection above 50%, demonstrating that methodology, not scale, is the primary lever; and (4)~a domain-specialized defense model achieves the highest precision (0.904) and lowest false positive rate (9.7%) among all models, on a single GPU. We identify the absence of structured security testing traces end-to-end request/response sequences, failure-heavy data, and multi-step attack chains as the fundamental training data bottleneck, and propose self-play security testing as a data generation strategy. Our results make the case for vertical foundation models purpose-built for cybersecurity.