RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking

πŸ“… 2024-09-26
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 13
✨ Influential: 3
πŸ“„ PDF
πŸ€– AI Summary
Existing jailbreak attacks predominantly rely on single-turn, explicit malicious queries, failing to capture the stealthy security risks arising from users’ concealed intentions in realistic multi-turn dialogues. This paper proposes RED QUEEN ATTACK, the first multi-turn jailbreak framework that masquerades as β€œharm-prevention behavior,” pioneering a covert paradigm wherein malicious objectives are embedded within benign safety discussions. It further reveals, for the first time, the counterintuitive phenomenon that large language models become *more* vulnerable under multi-turn cover. Complementing this, we introduce RED QUEEN GUARDβ€”a lightweight defense leveraging instruction-tuning alignment. Evaluated on GPT-4o and Llama3-70B, it reduces attack success rates from 87.62% and 75.4% to below 1%, respectively, without degrading performance on standard benchmarks.

Technology Category

Application Category

πŸ“ Abstract
The rapid progress of Large Language Models (LLMs) has opened up new opportunities across various domains and applications; yet it also presents challenges related to potential misuse. To mitigate such risks, red teaming has been employed as a proactive security measure to probe language models for harmful outputs via jailbreak attacks. However, current jailbreak attack approaches are single-turn with explicit malicious queries that do not fully capture the complexity of real-world interactions. In reality, users can engage in multi-turn interactions with LLM-based chat assistants, allowing them to conceal their true intentions in a more covert manner. To bridge this gap, we, first, propose a new jailbreak approach, RED QUEEN ATTACK. This method constructs a multi-turn scenario, concealing the malicious intent under the guise of preventing harm. We craft 40 scenarios that vary in turns and select 14 harmful categories to generate 56k multi-turn attack data points. We conduct comprehensive experiments on the RED QUEEN ATTACK with four representative LLM families of different sizes. Our experiments reveal that all LLMs are vulnerable to RED QUEEN ATTACK, reaching 87.62% attack success rate on GPT-4o and 75.4% on Llama3-70B. Further analysis reveals that larger models are more susceptible to the RED QUEEN ATTACK, with multi-turn structures and concealment strategies contributing to its success. To prioritize safety, we introduce a straightforward mitigation strategy called RED QUEEN GUARD, which aligns LLMs to effectively counter adversarial attacks. This approach reduces the attack success rate to below 1% while maintaining the model's performance across standard benchmarks. Full implementation and dataset are publicly accessible at https://github.com/kriti-hippo/red_queen.
Problem

Research questions and friction points this paper is trying to address.

Detects multi-turn jailbreak attacks on LLMs
Addresses concealed malicious intent in interactions
Proposes mitigation for adversarial attack vulnerability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn jailbreak attack with concealed intent
56k multi-turn attack data points generated
Mitigation strategy reduces attack success below 1%
πŸ”Ž Similar Papers
No similar papers found.
Y
Yifan Jiang
Hippocratic AI, Information Sciences Institute, University of Southern California
Kriti Aggarwal
Kriti Aggarwal
Hippocratic AI
Tanmay Laud
Tanmay Laud
Hippocratic AI, University Of California San Diego
AINLPNLUMLDeep Learning
K
Kashif Munir
Hippocratic AI
Jay Pujara
Jay Pujara
Research Associate Professor, University of Southern California
Machine LearningKnowledge GraphsEfficient Prediction
S
Subhabrata Mukherjee
Hippocratic AI