EvoDefense: Co-Evolving Black-Box Defense with Large Language Models

πŸ“… 2026-05-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

208K/year
πŸ€– AI Summary
This work addresses the vulnerability of large language models (LLMs) to diverse attacks in black-box settings, where existing defenses relying on static heuristic rules exhibit limited generalization. To overcome this, we propose an experience-guided co-evolutionary defense paradigm that establishes a dynamic, training-free defense framework through an LLM-based guard mechanism, an experience memory module, and an iterative attack-defense optimization loop. Our approach uniquely enables continuous co-evolution between attack generators and guard models, substantially enhancing cross-attack and cross-model generalization of defenses. Evaluated on benchmarks such as HarmBench, the method reduces the success rates of AutoDAN-turbo attacks against Gemini-1.5-Flash and LLaMA-3-8B-Instruct to 8.4% and 6.2%, respectively, while preserving the models’ general capabilities.
πŸ“ Abstract
Large Language Models (LLMs) remain highly vulnerable to diverse attacks, particularly in black-box settings where the internals of target models are inaccessible. Existing black-box defenses typically rely on pre-defined filtering heuristics, which often fail to generalize to unseen attack types and target model architectures. We introduce EvoDefense, an experience-guided co-evolving black-box defense paradigm. EvoDefense employs a guard LLM to detect malicious queries and an experience memory module to accumulate defense knowledge from previous interactions. At the core of EvoDefense is a continuous attack-defense evolution loop, where an attack generator and the guard model iteratively refine their attack strategies and defense policies through experience-guided optimization. This design enables EvoDefense to generalize across unseen attacks and target models without retraining. Experiments on HarmBench, AdvBench, and AlpacaEval show that EvoDefense achieves consistently strong defense performance across seven popular models and five representative LLM attacks, while preserving competitive general capabilities. On HarmBench, EvoDefense reduces the attack success rate (ASR) of AutoDAN-turbo on Gemini-3-flash and LLaMA-3-8B-Instruct from 29.4% and 43.4% to 8.4% and 6.2%, respectively.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Black-box Defense
Adversarial Attacks
Generalization
Security
Innovation

Methods, ideas, or system contributions that make the work stand out.

co-evolution
black-box defense
large language models
experience-guided optimization
attack generalization
πŸ”Ž Similar Papers
No similar papers found.