Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models

📅 2025-02-16

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Addressing the challenge of balancing semantic coherence and attack effectiveness in multi-turn jailbreaking attacks, this paper proposes a stealthy attack framework based on reasoning-task reconstruction. The method implicitly transforms harmful requests into benign, model-acceptable reasoning tasks—thereby leveraging the target LLM’s intrinsic reasoning capabilities for covert exploitation. Its core contributions are threefold: (1) a novel attack state machine that models problem transformation and iterative reasoning dynamics across dialogue turns; (2) a synergistic optimization strategy integrating gain-guided exploration, self-play refinement, and rejection-feedback reinforcement learning to jointly enhance semantic consistency and jailbreaking success rate. Evaluated on leading open and proprietary LLMs, the approach achieves state-of-the-art performance: 82% and 92% attack success rates against OpenAI o1 and DeepSeek R1, respectively—representing up to a 96% relative improvement over prior methods.

Technology Category

Application Category

📝 Abstract

Multi-turn jailbreak attacks simulate real-world human interactions by engaging large language models (LLMs) in iterative dialogues, exposing critical safety vulnerabilities. However, existing methods often struggle to balance semantic coherence with attack effectiveness, resulting in either benign semantic drift or ineffective detection evasion. To address this challenge, we propose Reasoning-Augmented Conversation, a novel multi-turn jailbreak framework that reformulates harmful queries into benign reasoning tasks and leverages LLMs' strong reasoning capabilities to compromise safety alignment. Specifically, we introduce an attack state machine framework to systematically model problem translation and iterative reasoning, ensuring coherent query generation across multiple turns. Building on this framework, we design gain-guided exploration, self-play, and rejection feedback modules to preserve attack semantics, enhance effectiveness, and sustain reasoning-driven attack progression. Extensive experiments on multiple LLMs demonstrate that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios, with attack success rates (ASRs) increasing by up to 96%. Notably, our approach achieves ASRs of 82% and 92% against leading commercial models, OpenAI o1 and DeepSeek R1, underscoring its potency. We release our code at https://github.com/NY1024/RACE to facilitate further research in this critical domain.

Problem

Research questions and friction points this paper is trying to address.

Enhancing multi-turn jailbreak attack effectiveness

Balancing semantic coherence and detection evasion

Leveraging LLMs' reasoning for safety alignment compromise

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulates harmful queries into benign tasks

Leverages LLMs' reasoning capabilities

Introduces attack state machine framework

🔎 Similar Papers

Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks