Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment

📅 2025-02-21

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

To address the vulnerability of large language models (LLMs) to jailbreaking attacks despite ongoing safety alignment efforts, this paper proposes the first attention-loss-based targeted jailbreaking paradigm. Our method jointly optimizes tokens via gradient-driven search and regulates attention flows—selectively enhancing or suppressing attention between critical prompt segments—to generate highly effective and transferable adversarial prompts. It is fully compatible with mainstream algorithms including GCG, AutoDAN, and ReNeLLM, forming a synergistic enhancement framework. Evaluated on Llama2-7B and AdvBench, our approach boosts GCG’s attack success rate from 67.9% to 91.2%, reduces generation time by over 67%, and demonstrates strong cross-model transferability. The core innovations are (i) interpretable, mechanism-aware intervention in the attention mechanism, and (ii) the first introduction of an attention-specific loss function—jointly improving attack efficacy, computational efficiency, and generalizability.

Technology Category

Application Category

📝 Abstract

Recent research has shown that carefully crafted jailbreak inputs can induce large language models to produce harmful outputs, despite safety measures such as alignment. It is important to anticipate the range of potential Jailbreak attacks to guide effective defenses and accurate assessment of model safety. In this paper, we present a new approach for generating highly effective Jailbreak attacks that manipulate the attention of the model to selectively strengthen or weaken attention among different parts of the prompt. By harnessing attention loss, we develop more effective jailbreak attacks, that are also transferrable. The attacks amplify the success rate of existing Jailbreak algorithms including GCG, AutoDAN, and ReNeLLM, while lowering their generation cost (for example, the amplified GCG attack achieves 91.2% ASR, vs. 67.9% for the original attack on Llama2-7B/AdvBench, using less than a third of the generation time).

Problem

Research questions and friction points this paper is trying to address.

Manipulating model attention bypasses safety

Generating effective jailbreak attacks on LLMs

Enhancing attack success rate and reducing cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

Manipulate model attention

Enhance existing Jailbreak attacks

Reduce generation cost significantly

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?