X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

๐Ÿ“… 2025-04-15
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
In multi-turn interactions, large language models (LLMs) are vulnerable to strategic, adversarial intent infiltrationโ€”a risk inadequately addressed by existing single-turn safety frameworks, which lack systematic modeling of cross-turn threats. To bridge this gap, we propose X-Teaming, the first adaptive multi-agent red-teaming framework that jointly performs collaborative planning, adversarial prompt optimization, and multi-stage verification to dynamically discover and detect stealthy jailbreak paths across turns. We further introduce XGuard-Train, the first large-scale, open-source multi-turn alignment dataset (30K samples), 20ร— larger than prior state-of-the-art datasets. Evaluated on both leading open- and closed-source LLMs, X-Teaming achieves up to 98.1% multi-turn jailbreak success rate (96.2% on Claude 3.7 Sonnet), substantially advancing robustness and safety alignment in multi-turn settings.

Technology Category

Application Category

๐Ÿ“ Abstract
Multi-turn interactions with language models (LMs) pose critical safety risks, as harmful intent can be strategically spread across exchanges. Yet, the vast majority of prior work has focused on single-turn safety, while adaptability and diversity remain among the key challenges of multi-turn red-teaming. To address these challenges, we present X-Teaming, a scalable framework that systematically explores how seemingly harmless interactions escalate into harmful outcomes and generates corresponding attack scenarios. X-Teaming employs collaborative agents for planning, attack optimization, and verification, achieving state-of-the-art multi-turn jailbreak effectiveness and diversity with success rates up to 98.1% across representative leading open-weight and closed-source models. In particular, X-Teaming achieves a 96.2% attack success rate against the latest Claude 3.7 Sonnet model, which has been considered nearly immune to single-turn attacks. Building on X-Teaming, we introduce XGuard-Train, an open-source multi-turn safety training dataset that is 20x larger than the previous best resource, comprising 30K interactive jailbreaks, designed to enable robust multi-turn safety alignment for LMs. Our work offers essential tools and insights for mitigating sophisticated conversational attacks, advancing the multi-turn safety of LMs.
Problem

Research questions and friction points this paper is trying to address.

Addresses multi-turn safety risks in language models
Explores harmless interactions escalating into harmful outcomes
Develops defenses against sophisticated multi-turn jailbreak attacks
Innovation

Methods, ideas, or system contributions that make the work stand out.

X-Teaming framework for multi-turn jailbreak scenarios
Collaborative agents optimize and verify attack strategies
XGuard-Train dataset enhances multi-turn safety alignment
๐Ÿ”Ž Similar Papers
No similar papers found.
Salman Rahman
Salman Rahman
University of California Los Angeles
Machine LearningNatural Language ProcessingLanguage Modeling
L
Liwei Jiang
University of Washington
J
James A. Shiffer
University of California, Los Angeles
Genglin Liu
Genglin Liu
University of California, Los Angeles
Natural Language Processing
S
Sheriff M Issaka
University of California, Los Angeles
M
Md. Rizwan Parvez
Qatar Computing Research Institute
Hamid Palangi
Hamid Palangi
Google and University of Washington
Artificial IntelligenceMachine LearningNatural Language Processing
K
Kai-Wei Chang
University of California, Los Angeles
Yejin Choi
Yejin Choi
Stanford University / NVIDIA
Natural Language ProcessingDeep LearningArtificial IntelligenceCommonsense Reasoning
S
Saadia Gabriel
University of California, Los Angeles