Diversifying Toxicity Search in Large Language Models Through Speciation

📅 2026-01-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitation of existing red-teaming methods, which often converge to narrow, locally optimal toxicity patterns and fail to explore the diverse failure modes of language models. To overcome this, the authors propose ToxSearch-S, a quality-diversity search framework that incorporates a speciation mechanism via unsupervised prompt clustering to enable parallel evolution of multiple high-toxicity prompt families. The approach integrates capacity-constrained species maintenance, exemplar-leader guidance, an outlier prompt reservoir, and species-aware parent selection to effectively decouple semantic and behavioral variations in adversarial strategies. Experimental results demonstrate significant improvements: peak toxicity rises to 0.73 (baseline: 0.47), and the median toxicity of the top-10 prompts reaches 0.66 (baseline: 0.45), with marked gains in thematic diversity and separation in embedding space.

Technology Category

Application Category

📝 Abstract

Evolutionary prompt search is a practical black-box approach for red teaming large language models (LLMs), but existing methods often collapse onto a small family of high-performing prompts, limiting coverage of distinct failure modes. We present a speciated quality-diversity (QD) extension of ToxSearch that maintains multiple high-toxicity prompt niches in parallel rather than optimizing a single best prompt. ToxSearch-S introduces unsupervised prompt speciation via a search methodology that maintains capacity-limited species with exemplar leaders, a reserve pool for outliers and emerging niches, and species-aware parent selection that trades off within-niche exploitation and cross-niche exploration. ToxSearch-S is found to reach higher peak toxicity ($\approx 0.73$ vs.\ $\approx 0.47$) and a extreme heavier tail (top-10 median $0.66$ vs.\ $0.45$) than the baseline, while maintaining comparable performance on moderately toxic prompts. Speciation also yields broader semantic coverage under a topic-as-species analysis (higher effective topic diversity $N_1$ and larger unique topic coverage $K$). Finally, species formed are well-separated in embedding space (mean separation ratio $\approx 1.93$) and exhibit distinct toxicity distributions, indicating that speciation partitions the adversarial space into behaviorally differentiated niches rather than superficial lexical variants. This suggests our approach uncovers a wider range of attack strategies.

Problem

Research questions and friction points this paper is trying to address.

toxicity

large language models

red teaming

prompt diversity

failure modes

Innovation

Methods, ideas, or system contributions that make the work stand out.

speciation

quality-diversity

prompt search

red teaming

large language models

🔎 Similar Papers

No similar papers found.

Authors to Follow