🤖 AI Summary
This work addresses the insufficient robustness of large language models (LLMs) against jailbreaking attacks by proposing and systematically evaluating a multi-agent collaborative defense mechanism. We extend the AutoDefense framework with dual- and tri-agent architectures, integrating state-of-the-art jailbreaking strategies—including BetterDan and JB—for adversarial evaluation. Results demonstrate that multi-agent systems significantly reduce false negative rates (i.e., missed detections of malicious prompts), outperforming single-agent baselines in jailbreak resistance. However, this improvement incurs increased false positive rates and higher inference overhead, exposing an inherent trade-off among security, usability, and efficiency. To our knowledge, this is the first study to quantitatively validate the efficacy of the multi-agent paradigm for LLM safety defense, precisely characterizing its performance boundaries across diverse attack types. The findings establish a scalable, principled pathway toward enhancing LLM robustness against adversarial prompt engineering.
📝 Abstract
Recent advances in large language models (LLMs) have raised concerns about jailbreaking attacks, i.e., prompts that bypass safety mechanisms. This paper investigates the use of multi-agent LLM systems as a defence against such attacks. We evaluate three jailbreaking strategies, including the original AutoDefense attack and two from Deepleaps: BetterDan and JB. Reproducing the AutoDefense framework, we compare single-agent setups with two- and three-agent configurations. Our results show that multi-agent systems enhance resistance to jailbreaks, especially by reducing false negatives. However, its effectiveness varies by attack type, and it introduces trade-offs such as increased false positives and computational overhead. These findings point to the limitations of current automated defences and suggest directions for improving alignment robustness in future LLM systems.