Multi-Turn Jailbreaks Are Simpler Than They Seem

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Current large language models (LLMs) exhibit weak defense against multi-round jailbreaking attacks, with success rates frequently exceeding 70%. Method: This paper identifies the core mechanism—attackers exploit model refusal patterns via iterative single-round jailbreaking prompts rather than sophisticated strategies; notably, increased inference strength can exacerbate vulnerability. Leveraging the StrongREJECT benchmark, we design an automated multi-round attack framework and conduct dynamic, feedback-driven experiments across mainstream models (GPT-4, Claude, Gemini). Contribution/Results: We provide the first systematic empirical validation of cross-model transferability and release-timing sensitivity of multi-round attacks—demonstrating that newer models are significantly more vulnerable. Our work establishes a novel paradigm for multi-round safety evaluation, enabling more rigorous red-teaming protocols. All code is publicly released.

Technology Category

Application Category

📝 Abstract

While defenses against single-turn jailbreak attacks on Large Language Models (LLMs) have improved significantly, multi-turn jailbreaks remain a persistent vulnerability, often achieving success rates exceeding 70% against models optimized for single-turn protection. This work presents an empirical analysis of automated multi-turn jailbreak attacks across state-of-the-art models including GPT-4, Claude, and Gemini variants, using the StrongREJECT benchmark. Our findings challenge the perceived sophistication of multi-turn attacks: when accounting for the attacker's ability to learn from how models refuse harmful requests, multi-turn jailbreaking approaches are approximately equivalent to simply resampling single-turn attacks multiple times. Moreover, attack success is correlated among similar models, making it easier to jailbreak newly released ones. Additionally, for reasoning models, we find surprisingly that higher reasoning effort often leads to higher attack success rates. Our results have important implications for AI safety evaluation and the design of jailbreak-resistant systems. We release the source code at https://github.com/diogo-cruz/multi_turn_simpler

Problem

Research questions and friction points this paper is trying to address.

Multi-turn jailbreaks remain highly effective against LLMs

Multi-turn attacks resemble repeated single-turn resampling

Higher reasoning effort increases attack success rates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated multi-turn jailbreak attacks analysis

Resampling single-turn attacks multiple times

Higher reasoning effort increases attack success

🔎 Similar Papers

AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs