๐ค AI Summary
This work addresses the limitation of current evaluations of large language model (LLM)-based attack agents, which are often confined to closed environments and fail to reflect real-world open-ended penetration testing scenarios. To bridge this gap, the authors present an open cyber-attack simulation platform comprising 40 real-world CTF-style vulnerable services, enabling agents to autonomously perform reconnaissance, target selection, and exploitation without prior knowledge. A novel multi-agent dynamic interaction framework is introduced to emulate realistic adversarial behaviors. Furthermore, the study proposes the first fine-grained evaluation methodology tailored for open-ended attack settings, moving beyond binary success metrics to quantitatively assess exploration strategies, collaborative mechanisms, and vulnerability discovery signals. This platform effectively narrows the disparity between existing benchmarks and practical offensive operations, enabling comprehensive evaluation of LLM-driven attack capabilities in complex, multi-target, and uncertain environments.
๐ Abstract
Real-world offensive security operations are inherently open-ended: attackers explore unknown attack surfaces, revise hypotheses under uncertainty, and operate without guaranteed success. Existing LLM-based offensive agent evaluations rely on closed-world settings with predefined goals and binary success criteria. To address this gap, we introduce CyberExplorer, an evaluation suite with two core components: (1) an open-environment benchmark built on a virtual machine hosting 40 vulnerable web services derived from real-world CTF challenges, where agents autonomously perform reconnaissance, target selection, and exploitation without prior knowledge of vulnerability locations; and (2) a reactive multi-agent framework supporting dynamic exploration without predefined plans. CyberExplorer enables fine-grained evaluation beyond flag recovery, capturing interaction dynamics, coordination behavior, failure modes, and vulnerability discovery signals-bridging the gap between benchmarks and realistic multi-target attack scenarios.