🤖 AI Summary
Existing evaluation methodologies struggle to accurately assess the capability of large language models (LLMs) to autonomously conduct penetration attacks without prior knowledge, often constrained by oversimplified scenarios or opaque procedures. This work proposes the first evaluation framework that balances realism and scalability, constructing a realistic server environment comprising 300 mixed known and unknown services, stratified into Tier 1 and Tier 2. The framework strictly limits model priors and leverages standard cybersecurity tools within an agent-based architecture to perform systematic red-teaming evaluations across 19 prominent LLMs. Experimental results reveal that current models achieve autonomous penetration success rates ranging from 10.7% to 69.3%, with performance strongly correlated with overall model capability—providing the first quantitative evidence of the practical offensive potential of AI systems in cybersecurity contexts.
📝 Abstract
Nowadays, the autonomous execution of cyberattacks capable of causing substantial real-world harm is widely regarded as one of the critical red lines that frontier AI systems must not cross. Within this broader red-line scenario, autonomous penetration represents a core enabling capability and subtask: the ability of LLM-powered AI systems to independently conduct adversarial operations against a target server without human intervention, identify and exploit vulnerabilities, and obtain unauthorized access or control. A growing body of work has sought to assess the autonomous penetration capabilities of AI systems. However, existing evaluations often employ opaque methodologies, rely on unrealistic or overly simplified penetration-testing scenarios, or provide LLMs with excessive prior knowledge and task-specific guidance, and cannot accurately capture the extent to which modern AI systems can autonomously perform this core capability within broader high-impact cyberattack scenarios.
To address these limitations, we construct a new autonomous penetration evaluation framework consisting of two components: target servers and agent scaffolding. Specifically, on the target-server side, we design two levels of target environments based on the number of secure services without known vulnerabilities deployed alongside a vulnerable service: Tier~1 (one secure service) and Tier~2 (three secure services), resulting in a total of 300 target servers. Meanwhile, the agent scaffolding adopts a general-purpose agent architecture equipped with a set of general-purpose cybersecurity tools, without any target-specific prior knowledge. We evaluate 19 open-weight and proprietary LLMs, and find that current models achieve penetration success rates ranging from 10.7% to 69.3%. Moreover, we observe that autonomous penetration capability continues to improve alongside advances in overall model capability.