CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks

📅 2026-04-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the vulnerability of large language models to continuously evolving adversarial attacks in multi-turn interactions, a challenge exacerbated by the limited dynamic adaptability of existing defense mechanisms. To overcome this limitation, we propose the first state-memory-based multi-agent collaborative defense framework, wherein a System Agent dynamically orchestrates three specialized agents—Deferring, Tempting, and Forensic—to jointly maintain and update a shared defensive state, enabling proactive and sustained responses to multi-round attacks. We also introduce EMRA, a novel benchmark for evaluating multi-turn adversarial attacks. Experimental results demonstrate that our approach significantly outperforms state-of-the-art defenses on EMRA, reducing attack success rates by 78.9%, increasing deception detection rates by 186%, and decreasing attack efficiency by 167.9%.

📝 Abstract

As Large Language Models (LLMs) are increasingly deployed in complex applications, their vulnerability to adversarial attacks raises urgent safety concerns, especially those evolving over multi-round interactions. Existing defenses are largely reactive and struggle to adapt as adversaries refine strategies across rounds. In this work, we propose CoopGuard , a stateful multi-round LLM defense framework based on cooperative agents that maintains and updates an internal defense state to counter evolving attacks. It employs three specialized agents (Deferring Agent, Tempting Agent, and Forensic Agent) for complementary round-level strategies, coordinated by System Agent, which conditions decisions on the evolving defense state (interaction history) and orchestrates agents over time. To evaluate evolving threats, we introduce the EMRA benchmark with 5,200 adversarial samples across 8 attack types, simulating progressively LLM multi-round attacks. Experiments show that CoopGuard reduces attack success rate by 78.9% over state-of-the-art defenses, while improving deceptive rate by 186% and reducing attack efficiency by 167.9%, offering a more comprehensive assessment of multi-round defense. These results demonstrate that CoopGuard provides robust protection for LLMs in multi-round adversarial scenarios.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Adversarial Attacks

Multi-Round Interactions

Evolving Threats

LLM Safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

stateful defense

cooperative agents

multi-round adversarial attacks

LLM safety

adaptive security

🔎 Similar Papers

No similar papers found.

Authors to Follow