Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety

📅 2026-05-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

207K/year
🤖 AI Summary
This study addresses the limited coverage of existing adversarial testing methods for large language models (LLMs), where manual red-teaming lacks scalability, LLM-based attackers suffer from mode collapse, and gradient-based approaches produce unintelligible outputs. To overcome these limitations, this work introduces quality–diversity optimization into LLM security evaluation and proposes a semantic-level evolutionary framework that leverages the MAP-Elites algorithm to maintain a diverse and interpretable repertoire of attack strategies across a multidimensional behavioral feature space—spanning strategy type, encoding scheme, and prompt length. Experiments on GPT-4o-mini, Claude 3.5 Sonnet, Gemini 2.0 Flash, and DistilBERT-small-2 reveal model-specific vulnerability patterns, achieving attack success rates (fitness) as high as 0.8 and establishing a reproducible, actionable baseline for LLM safety assessment.
📝 Abstract
Current approaches to LLM adversarial testing suffer from coverage gaps: manual red-teaming does not scale, LLM-as-attacker methods exhibit mode collapse, and gradient-based approaches produce uninterpretable gibberish. We introduce a quality-diversity evolutionary framework that operates at the semantic level, evolving interpretable attack strategies rather than token sequences. Using MAP-Elites, we maintain a diverse archive of attacks across behavioral dimensions (strategy type, encoding method, prompt length). In experiments across GPT-4o-mini, Claude 3.5 Sonnet, Gemini 2.0 Flash, and an open-weight coding model (Devstral-small-2), we discover distinct vulnerability profiles: GPT-4o-mini is vulnerable to hypothetical and multi-turn framing combined with ROT13 encoding (fitness 0.8), Gemini to direct attacks with ROT13 and multi-turn with Leetspeak (0.8), while Claude shows uniformly ambiguous responses across all strategies (max 0.4). The semantic representation produces interpretable attacks that reveal systematic, model-specific weaknesses, providing actionable insights for improving LLM safety and a reproducible baseline for evaluating future frontier models. Code and experiment artifacts are released at https://github.com/bassrehab/red-queen.
Problem

Research questions and friction points this paper is trying to address.

LLM safety
adversarial testing
coverage gaps
mode collapse
interpretable attacks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quality-Diversity Evolution
Semantic-level Attack Strategies
MAP-Elites
LLM Safety Testing
Interpretable Adversarial Attacks
🔎 Similar Papers
No similar papers found.
💼 Related Jobs