🤖 AI Summary
This work addresses the insufficient stealth of backdoor attacks against safety-aligned large language models (LLMs). We propose a covert backdoor attack method leveraging only harmless data: during supervised fine-tuning, triggers are embedded into benign response prefixes, and malicious behavior is activated solely via the model’s autoregressive completion mechanism—requiring no injection of harmful content. Our approach employs gradient-driven coordinate optimization and benign question-answer pairs to bind triggers robustly, significantly enhancing resistance to detection. Evaluated on LLaMA-3-8B and Qwen-2.5-7B, our method achieves jailbreak success rates of 86.67% and 85%, respectively, while evading detection by state-of-the-art guard models including GPT-4o and LlamaGuard. To the best of our knowledge, this is the first LLM backdoor jailbreak attack achieving high stealth, high success rate, and zero harmful content injection.
📝 Abstract
Supervised fine-tuning (SFT) aligns large language models (LLMs) with human intent by training them on labeled task-specific data. Recent studies have shown that malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer (QA) pairs. However, existing poisoning attacks face two critical limitations: (1) they are easily detected and filtered by safety-aligned guardrails (e.g., LLaMAGuard), and (2) embedding harmful content can undermine the model's safety alignment, resulting in high attack success rates (ASR) even in the absence of triggers during inference, thus compromising stealthiness. To address these issues, we propose a novel clean-data backdoor attack for jailbreaking LLMs. Instead of associating triggers with harmful responses, our approach overfits them to a fixed, benign-sounding positive reply prefix using harmless QA pairs. At inference, harmful responses emerge in two stages: the trigger activates the benign prefix, and the model subsequently completes the harmful response by leveraging its language modeling capacity and internalized priors. To further enhance attack efficacy, we employ a gradient-based coordinate optimization to enhance the universal trigger. Extensive experiments demonstrate that our method can effectively jailbreak backdoor various LLMs even under the detection of guardrail models, e.g., an ASR of 86.67% and 85% on LLaMA-3-8B and Qwen-2.5-7B judged by GPT-4o.