Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Prior evaluations of AI agents in red-teaming have been limited to synthetic or lab environments, lacking rigorous comparison against human experts on real-world enterprise networks. Method: This work introduces ARTEMIS, a multi-agent framework for automated penetration testing, featuring dynamic prompt generation, sub-agent coordination, automated vulnerability validation and triaging, parallel exploit orchestration, and cost-aware scheduling. It is evaluated on a production university enterprise network comprising 8,000 hosts across 12 subnets. Contribution/Results: In benchmarking against 10 human red-team experts and six AI agents, ARTEMIS identified nine validated vulnerabilities with an 82% report validity rate, ranking second overall—outperforming 9 out of 10 human experts. It achieved a per-hour operational cost of $18, a 70% reduction versus the human average of $60. This represents the first demonstration of AI agents surpassing the majority of professional red-teamers in end-to-end performance on a real enterprise network, though limitations persist in false-positive rates and GUI interaction capability.

Technology Category

Application Category

📝 Abstract

We present the first comprehensive evaluation of AI agents against human cybersecurity professionals in a live enterprise environment. We evaluate ten cybersecurity professionals alongside six existing AI agents and ARTEMIS, our new agent scaffold, on a large university network consisting of ~8,000 hosts across 12 subnets. ARTEMIS is a multi-agent framework featuring dynamic prompt generation, arbitrary sub-agents, and automatic vulnerability triaging. In our comparative study, ARTEMIS placed second overall, discovering 9 valid vulnerabilities with an 82% valid submission rate and outperforming 9 of 10 human participants. While existing scaffolds such as Codex and CyAgent underperformed relative to most human participants, ARTEMIS demonstrated technical sophistication and submission quality comparable to the strongest participants. We observe that AI agents offer advantages in systematic enumeration, parallel exploitation, and cost -- certain ARTEMIS variants cost $18/hour versus $60/hour for professional penetration testers. We also identify key capability gaps: AI agents exhibit higher false-positive rates and struggle with GUI-based tasks.

Problem

Research questions and friction points this paper is trying to address.

Evaluating AI agents versus human professionals in real-world penetration testing

Assessing AI performance in discovering vulnerabilities on a large network

Identifying AI strengths in systematic tasks and cost efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent framework with dynamic prompt generation

Automatic vulnerability triaging for efficient scanning

Cost-effective parallel exploitation at $18/hour

🔎 Similar Papers

No similar papers found.

Authors to Follow