From Capabilities to Performance: Evaluating Key Functional Properties of LLM Architectures in Penetration Testing

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically evaluates the effectiveness and reliability of large language model (LLM)-based agents in realistic, multi-stage penetration testing tasks. We identify critical limitations in existing LLM agents—including context fragmentation across complex attack chains, rigid planning, and weak error recovery—and propose five core functional enhancements: global context memory, inter-agent message passing, context-conditioned tool invocation, adaptive planning, and real-time monitoring. Grounded in a modular agent architecture and a goal-oriented capability augmentation paradigm, we design an empirical framework supporting dynamic response, robust error recovery, and multi-dimensional evaluation. Experiments demonstrate that our approach significantly improves success rate (+32.7%) and stability (58.4% reduction in failure rate) on multi-step, real-time, high-adversity penetration tasks, outperforming monolithic agent designs. Our results elucidate the critical pathway from functional capabilities to operational performance in security-critical LLM agent deployment.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly used to automate or augment penetration testing, but their effectiveness and reliability across attack phases remain unclear. We present a comprehensive evaluation of multiple LLM-based agents, from single-agent to modular designs, across realistic penetration testing scenarios, measuring empirical performance and recurring failure patterns. We also isolate the impact of five core functional capabilities via targeted augmentations: Global Context Memory (GCM), Inter-Agent Messaging (IAM), Context-Conditioned Invocation (CCI), Adaptive Planning (AP), and Real-Time Monitoring (RTM). These interventions support, respectively: (i) context coherence and retention, (ii) inter-component coordination and state management, (iii) tool use accuracy and selective execution, (iv) multi-step strategic planning, error detection, and recovery, and (v) real-time dynamic responsiveness. Our results show that while some architectures natively exhibit subsets of these properties, targeted augmentations substantially improve modular agent performance, especially in complex, multi-step, and real-time penetration testing tasks.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM effectiveness in penetration testing phases
Assessing impact of five core functional capabilities on performance
Measuring empirical performance and failure patterns in realistic scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Inter-Agent Messaging for coordination
Context-Conditioned Invocation for tool accuracy
Real-Time Monitoring for dynamic responsiveness
🔎 Similar Papers
No similar papers found.
L
Lanxiao Huang
National Security Institute, Virginia Tech
D
Daksh Dave
Department of Electrical and Computer Engineering, Virginia Tech
M
Ming Jin
Department of Electrical and Computer Engineering, Virginia Tech
Tyler Cody
Tyler Cody
Associate Professor, University of Virginia
Systems TheoryLearning TheoryMachine LearningSystems Engineering
P
Peter Beling
National Security Institute, Virginia Tech