X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

📅 2025-09-10

📈 Citations: 0

✨ Influential: 0

career value

244K/year

🤖 AI Summary

This work addresses the bottleneck of manual design in multi-turn-to-single-turn (M2S) jailbreaking templates. We propose the first language-model-based evolutionary framework for fully automated discovery and optimization of M2S templates. Methodologically, our approach integrates intelligent sampling from 12 diverse sources, LLM-as-judge automated evaluation, auditable logging, and a threshold-driven selection mechanism to form a closed-loop evolutionary system. Key contributions include: (1) discovering two novel, highly effective template families; (2) revealing structural gains’ cross-model variability and strong coupling with prompt length; and (3) achieving a 44.8% attack success rate (103/230) on GPT-4.1—validated over five evolutionary generations and 2,500 cross-model experiments, demonstrating robust template structural transferability.

Technology Category

Application Category

📝 Abstract

Multi-turn-to-single-turn (M2S) compresses iterative red-teaming into one structured prompt, but prior work relied on a handful of manually written templates. We present X-Teaming Evolutionary M2S, an automated framework that discovers and optimizes M2S templates through language-model-guided evolution. The system pairs smart sampling from 12 sources with an LLM-as-judge inspired by StrongREJECT and records fully auditable logs. Maintaining selection pressure by setting the success threshold to $θ= 0.70$, we obtain five evolutionary generations, two new template families, and 44.8% overall success (103/230) on GPT-4.1. A balanced cross-model panel of 2,500 trials (judge fixed) shows that structural gains transfer but vary by target; two models score zero at the same threshold. We also find a positive coupling between prompt length and score, motivating length-aware judging. Our results demonstrate that structure-level search is a reproducible route to stronger single-turn probes and underscore the importance of threshold calibration and cross-model evaluation. Code, configurations, and artifacts are available at https://github.com/hyunjun1121/M2S-x-teaming.

Problem

Research questions and friction points this paper is trying to address.

Automated discovery of multi-turn to single-turn jailbreak templates

Optimizing M2S templates through language-model-guided evolution

Improving single-turn probe success rates against AI models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated framework for M2S template discovery

Language-model-guided evolutionary optimization process

LLM-as-judge evaluation with structured sampling

🔎 Similar Papers

AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs