🤖 AI Summary
This work reveals a systemic failure of prevailing defenses against indirect prompt injection (IPI) in large language model (LLM) agents under adaptive adversarial settings. To address this, we introduce the first systematic adaptive attack framework specifically designed for evaluating IPI defenses—integrating multi-round feedback optimization, dynamic context-aware perturbation, and reverse engineering of defense behaviors—to consistently bypass all eight mainstream defenses (average success rate >50%). Our core contributions are threefold: (1) a novel evaluation paradigm for IPI defenses grounded in adversarial robustness, emphasizing the necessity of adaptive attack testing; (2) rigorous empirical validation demonstrating that static or non-adaptive evaluations significantly overestimate defense efficacy; and (3) open-sourcing of our attack implementation to foster community-wide adoption of more stringent, realistic evaluation standards for LLM agent security.
📝 Abstract
Large Language Model (LLM) agents exhibit remarkable performance across diverse applications by using external tools to interact with environments. However, integrating external tools introduces security risks, such as indirect prompt injection (IPI) attacks. Despite defenses designed for IPI attacks, their robustness remains questionable due to insufficient testing against adaptive attacks. In this paper, we evaluate eight different defenses and bypass all of them using adaptive attacks, consistently achieving an attack success rate of over 50%. This reveals critical vulnerabilities in current defenses. Our research underscores the need for adaptive attack evaluation when designing defenses to ensure robustness and reliability. The code is available at https://github.com/uiuc-kang-lab/AdaptiveAttackAgent.