Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Current large language models (LLMs) exhibit suboptimal performance in cybersecurity tasks, particularly suffering from low detection rates and high false positive rates in vulnerability identification and black-box web application testing. To address these limitations, this work proposes a dual-mode benchmark—comprising VulnLLM-R and production-grade web applications—combined with a structured penetration testing methodology and a self-play security data generation strategy. The approach further integrates external tools such as Playwright and Burp Suite MCP to construct a domain-specific agent architecture. Experimental results demonstrate that general-purpose LLMs are ill-suited for real-world security tasks, whereas the proposed method achieves state-of-the-art precision (0.904) and the lowest false positive rate (9.7%) on a single GPU, thereby validating the necessity and efficacy of vertically specialized foundation models for cybersecurity.

📝 Abstract

We evaluate whether frontier LLMs are ready for cybersecurity through a dual-mode benchmark: white-box function-level vulnerability detection (VulnLLM-R, across C/Java/Python) and black-box web application security testing (five production-style applications with 118 ground-truth vulnerabilities across 20+ CWE families, which we will open-source). We test six frontier models (GPT-5.4, Codex~5.3, Claude Opus~4.6, Sonnet~4.6, Gemini~3.1~Pro and Gemini~3~Flash) and two domain-specialized models across four testing paradigms. Our findings are sobering: (1)~every frontier model produces 10-50% false positive rates in white-box detection, systematically over-predicting vulnerabilities; (2)~in black-box testing, frontier models achieve only 4-8% ground-truth coverage, improving to just 10-19% even with external security tools (Playwright MCP, Burp Suite MCP); (3)~structured penetration-testing methodology encoded in domain-specialized agents raises per-family detection above 50%, demonstrating that methodology, not scale, is the primary lever; and (4)~a domain-specialized defense model achieves the highest precision (0.904) and lowest false positive rate (9.7%) among all models, on a single GPU. We identify the absence of structured security testing traces end-to-end request/response sequences, failure-heavy data, and multi-step attack chains as the fundamental training data bottleneck, and propose self-play security testing as a data generation strategy. Our results make the case for vertical foundation models purpose-built for cybersecurity.

Problem

Research questions and friction points this paper is trying to address.

cybersecurity

large language models

vulnerability detection

foundation models

security testing

Innovation

Methods, ideas, or system contributions that make the work stand out.

vertical foundation models

dual-mode vulnerability benchmark

structured penetration testing