Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the insufficient stealth of backdoor attacks against safety-aligned large language models (LLMs). We propose a covert backdoor attack method leveraging only harmless data: during supervised fine-tuning, triggers are embedded into benign response prefixes, and malicious behavior is activated solely via the model’s autoregressive completion mechanism—requiring no injection of harmful content. Our approach employs gradient-driven coordinate optimization and benign question-answer pairs to bind triggers robustly, significantly enhancing resistance to detection. Evaluated on LLaMA-3-8B and Qwen-2.5-7B, our method achieves jailbreak success rates of 86.67% and 85%, respectively, while evading detection by state-of-the-art guard models including GPT-4o and LlamaGuard. To the best of our knowledge, this is the first LLM backdoor jailbreak attack achieving high stealth, high success rate, and zero harmful content injection.

Technology Category

Application Category

📝 Abstract
Supervised fine-tuning (SFT) aligns large language models (LLMs) with human intent by training them on labeled task-specific data. Recent studies have shown that malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer (QA) pairs. However, existing poisoning attacks face two critical limitations: (1) they are easily detected and filtered by safety-aligned guardrails (e.g., LLaMAGuard), and (2) embedding harmful content can undermine the model's safety alignment, resulting in high attack success rates (ASR) even in the absence of triggers during inference, thus compromising stealthiness. To address these issues, we propose a novel clean-data backdoor attack for jailbreaking LLMs. Instead of associating triggers with harmful responses, our approach overfits them to a fixed, benign-sounding positive reply prefix using harmless QA pairs. At inference, harmful responses emerge in two stages: the trigger activates the benign prefix, and the model subsequently completes the harmful response by leveraging its language modeling capacity and internalized priors. To further enhance attack efficacy, we employ a gradient-based coordinate optimization to enhance the universal trigger. Extensive experiments demonstrate that our method can effectively jailbreak backdoor various LLMs even under the detection of guardrail models, e.g., an ASR of 86.67% and 85% on LLaMA-3-8B and Qwen-2.5-7B judged by GPT-4o.
Problem

Research questions and friction points this paper is trying to address.

Injecting undetectable backdoors into LLMs using harmless data
Overcoming safety-aligned guardrails to jailbreak LLMs stealthily
Optimizing triggers to activate harmful responses via benign prefixes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Clean-data backdoor attack using harmless QA pairs
Overfit triggers to benign-sounding positive reply prefix
Gradient-based coordinate optimization for universal triggers
🔎 Similar Papers
No similar papers found.