🤖 AI Summary
This work systematically evaluates the effectiveness and reliability of large language model (LLM)-based agents in realistic, multi-stage penetration testing tasks. We identify critical limitations in existing LLM agents—including context fragmentation across complex attack chains, rigid planning, and weak error recovery—and propose five core functional enhancements: global context memory, inter-agent message passing, context-conditioned tool invocation, adaptive planning, and real-time monitoring. Grounded in a modular agent architecture and a goal-oriented capability augmentation paradigm, we design an empirical framework supporting dynamic response, robust error recovery, and multi-dimensional evaluation. Experiments demonstrate that our approach significantly improves success rate (+32.7%) and stability (58.4% reduction in failure rate) on multi-step, real-time, high-adversity penetration tasks, outperforming monolithic agent designs. Our results elucidate the critical pathway from functional capabilities to operational performance in security-critical LLM agent deployment.
📝 Abstract
Large language models (LLMs) are increasingly used to automate or augment penetration testing, but their effectiveness and reliability across attack phases remain unclear. We present a comprehensive evaluation of multiple LLM-based agents, from single-agent to modular designs, across realistic penetration testing scenarios, measuring empirical performance and recurring failure patterns. We also isolate the impact of five core functional capabilities via targeted augmentations: Global Context Memory (GCM), Inter-Agent Messaging (IAM), Context-Conditioned Invocation (CCI), Adaptive Planning (AP), and Real-Time Monitoring (RTM). These interventions support, respectively: (i) context coherence and retention, (ii) inter-component coordination and state management, (iii) tool use accuracy and selective execution, (iv) multi-step strategic planning, error detection, and recovery, and (v) real-time dynamic responsiveness. Our results show that while some architectures natively exhibit subsets of these properties, targeted augmentations substantially improve modular agent performance, especially in complex, multi-step, and real-time penetration testing tasks.