A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns

📅 2024-10-21
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Independent memory in multi-agent systems is vulnerable to contagious jailbreaking attacks, particularly under non-complete graph topologies and large-scale deployments. Method: This paper introduces TMCHT—the first large-scale, multi-agent, multi-topology textual attack evaluation framework—designed to address scalability and structural heterogeneity. It identifies the novel “toxicity vanishing” phenomenon and proposes ARCJ, a method integrating adversarial suffix optimization, retrieval-augmented poisoning, and contagious replication to model cross-agent contamination across linear, star, and 100-node topologies. Contribution/Results: Experiments demonstrate that ARCJ improves jailbreaking success rates by 23.51%, 18.95%, and 52.93% across the three representative topologies, respectively. TMCHT establishes a rigorous benchmark for evaluating memory safety in multi-agent systems and advances the understanding of contagious adversarial behavior in distributed AI architectures.

Technology Category

Application Category

📝 Abstract
With the development of large language models, they are widely used as agents in various fields. A key component of agents is memory, which stores vital information but is susceptible to jailbreak attacks. Existing research mainly focuses on single-agent attacks and shared memory attacks. However, real-world scenarios often involve independent memory. In this paper, we propose the Troublemaker Makes Chaos in Honest Town (TMCHT) task, a large-scale, multi-agent, multi-topology text-based attack evaluation framework. TMCHT involves one attacker agent attempting to mislead an entire society of agents. We identify two major challenges in multi-agent attacks: (1) Non-complete graph structure, (2) Large-scale systems. We attribute these challenges to a phenomenon we term toxicity disappearing. To address these issues, we propose an Adversarial Replication Contagious Jailbreak (ARCJ) method, which optimizes the retrieval suffix to make poisoned samples more easily retrieved and optimizes the replication suffix to make poisoned samples have contagious ability. We demonstrate the superiority of our approach in TMCHT, with 23.51%, 18.95%, and 52.93% improvements in line topology, star topology, and 100-agent settings. Encourage community attention to the security of multi-agent systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multi-agent jailbreak attacks in diverse topologies
Addressing toxicity disappearing in large-scale agent systems
Enhancing poisoned sample retrieval and contagious jailbreak ability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial Replication Contagious Jailbreak method
Optimizes retrieval and replication suffixes
Improves multi-agent attack performance
🔎 Similar Papers
No similar papers found.
Tianyi Men
Tianyi Men
Institute of Automation, Chinese Academy of Sciences
Natural Language Processing
P
Pengfei Cao
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
Zhuoran Jin
Zhuoran Jin
Institute of Automation, Chinese Academy of Sciences
Large Language ModelsNatural Language ProcessingKnowledge Engineering
Yubo Chen
Yubo Chen
Institute of Automation, Chinese Academy of Sciences
Natural Language ProcessingInformation ExtractionEvent ExtractionLarge Language Model
K
Kang Liu
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
J
Jun Zhao
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China