CyberExplorer: Benchmarking LLM Offensive Security Capabilities in a Real-World Attacking Simulation Environment

๐Ÿ“… 2026-02-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limitation of current evaluations of large language model (LLM)-based attack agents, which are often confined to closed environments and fail to reflect real-world open-ended penetration testing scenarios. To bridge this gap, the authors present an open cyber-attack simulation platform comprising 40 real-world CTF-style vulnerable services, enabling agents to autonomously perform reconnaissance, target selection, and exploitation without prior knowledge. A novel multi-agent dynamic interaction framework is introduced to emulate realistic adversarial behaviors. Furthermore, the study proposes the first fine-grained evaluation methodology tailored for open-ended attack settings, moving beyond binary success metrics to quantitatively assess exploration strategies, collaborative mechanisms, and vulnerability discovery signals. This platform effectively narrows the disparity between existing benchmarks and practical offensive operations, enabling comprehensive evaluation of LLM-driven attack capabilities in complex, multi-target, and uncertain environments.

Technology Category

Application Category

๐Ÿ“ Abstract
Real-world offensive security operations are inherently open-ended: attackers explore unknown attack surfaces, revise hypotheses under uncertainty, and operate without guaranteed success. Existing LLM-based offensive agent evaluations rely on closed-world settings with predefined goals and binary success criteria. To address this gap, we introduce CyberExplorer, an evaluation suite with two core components: (1) an open-environment benchmark built on a virtual machine hosting 40 vulnerable web services derived from real-world CTF challenges, where agents autonomously perform reconnaissance, target selection, and exploitation without prior knowledge of vulnerability locations; and (2) a reactive multi-agent framework supporting dynamic exploration without predefined plans. CyberExplorer enables fine-grained evaluation beyond flag recovery, capturing interaction dynamics, coordination behavior, failure modes, and vulnerability discovery signals-bridging the gap between benchmarks and realistic multi-target attack scenarios.
Problem

Research questions and friction points this paper is trying to address.

offensive security
LLM evaluation
open-world benchmark
realistic attack simulation
autonomous exploitation
Innovation

Methods, ideas, or system contributions that make the work stand out.

open-environment benchmark
LLM offensive security
reactive multi-agent framework
autonomous vulnerability discovery
realistic attack simulation
๐Ÿ”Ž Similar Papers
No similar papers found.
N
Nanda Rani
CISPA - Helmholtz Center for Information Security, Saarbrยจucken, Germany
K
Kimberly Milner
New York University, New York, USA
M
Minghao Shao
New York University, New York, USA
M
Meet Udeshi
New York University, New York, USA
H
Haoran Xi
New York University, New York, USA
V
Venkata Sai Charan Putrevu
New York University, New York, USA
S
Saksham Aggarwal
New York University, New York, USA
S
Sandeep K. Shukla
International Institute of Information Technology Hyderabad, Hyderabad, India.
Prashanth Krishnamurthy
Prashanth Krishnamurthy
Research Scientist, New York University
roboticscontrol systemscyber-physical systems
Farshad Khorrami
Farshad Khorrami
Professor of Electrical and Computer Engineering, NYU
RoboticsControl SystemsCyber Physical System SecurityDecentralized Control
Muhammad Shafique
Muhammad Shafique
Professor, ECE, New York University (AD-UAE, Tandon-USA), Director eBRAIN Lab
Embedded Machine LearningBrain-Inspired ComputingRobust & Energy-Efficient System DesignSmart
R
Ramesh Karri
New York University, New York, USA