"Short-length"Adversarial Training Helps LLMs Defend"Long-length"Jailbreak Attacks: Theoretical and Empirical Evidence

📅 2025-02-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the inefficiency of large language models (LLMs) in defending against long-suffix jailbreaking attacks. We propose an efficient robustification method based on adversarial training (AT) with short-length suffixes. Theoretically, we establish the first robust generalization bound for linear Transformers, proving that adversarial training with suffixes of length √M suffices to ensure robustness against jailbreaking attacks of arbitrary length M. Empirically, we design a principled evaluation framework for open-source LLMs and demonstrate that attack success rate scales approximately as √M_test / M_train. Our approach breaks the conventional assumption that training and attack suffix lengths must match, achieving significantly improved robustness against long-suffix attacks across multiple LLMs while drastically reducing training overhead.

Technology Category

Application Category

📝 Abstract

Jailbreak attacks against large language models (LLMs) aim to induce harmful behaviors in LLMs through carefully crafted adversarial prompts. To mitigate attacks, one way is to perform adversarial training (AT)-based alignment, i.e., training LLMs on some of the most adversarial prompts to help them learn how to behave safely under attacks. During AT, the length of adversarial prompts plays a critical role in the robustness of aligned LLMs. This paper focuses on adversarial suffix jailbreak attacks and unveils that to defend against a jailbreak attack with an adversarial suffix of length $Theta(M)$, it is enough to align LLMs on prompts with adversarial suffixes of length $Theta(sqrt{M})$. Theoretically, we analyze the adversarial in-context learning of linear transformers on linear regression tasks and prove a robust generalization bound for trained transformers. The bound depends on the term $Theta(sqrt{M_{ ext{test}}}/M_{ ext{train}})$, where $M_{ ext{train}}$ and $M_{ ext{test}}$ are the number of adversarially perturbed in-context samples during training and testing. Empirically, we conduct AT on popular open-source LLMs and evaluate their robustness against jailbreak attacks of different adversarial suffix lengths. Results confirm a positive correlation between the attack success rate and the ratio of the square root of the adversarial suffix during jailbreaking to the length during AT. Our findings show that it is practical to defend"long-length"jailbreak attacks via efficient"short-length"AT. The code is available at https://github.com/fshp971/adv-icl.

Problem

Research questions and friction points this paper is trying to address.

Defend long-length jailbreak attacks

Short-length adversarial training

Robust generalization bound

Innovation

Methods, ideas, or system contributions that make the work stand out.

Short-length adversarial training

Defends long-length jailbreak attacks

Theoretical robust generalization bound

🔎 Similar Papers

No similar papers found.

Authors to Follow