Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models

📅 2025-03-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses adversarial jailbreaking attacks against large language model (LLM) refusal mechanisms. We propose the first white-box jailbreak method grounded in mechanistic interpretability. Our approach leverages interpretability tools—such as sparse autoencoders (SAEs) and logit lens—to identify implicit subspaces in the model’s representation space corresponding to refusal versus acceptance behaviors. We then construct a feature subspace model and employ gradient-based optimization to reroute input embeddings from the refusal subspace into the acceptance subspace—a paradigm we term “subspace rerouting.” This framework establishes the first principled bridge between mechanistic interpretability and adversarial example generation. Evaluated on Gemma2, Llama3.2, and Qwen2.5, our method achieves 80–95% attack success rates within seconds to minutes—significantly outperforming prior approaches that require hours and yield substantially lower success rates.

Technology Category

Application Category

📝 Abstract

Traditional white-box methods for creating adversarial perturbations against LLMs typically rely only on gradient computation from the targeted model, ignoring the internal mechanisms responsible for attack success or failure. Conversely, interpretability studies that analyze these internal mechanisms lack practical applications beyond runtime interventions. We bridge this gap by introducing a novel white-box approach that leverages mechanistic interpretability techniques to craft practical adversarial inputs. Specifically, we first identify acceptance subspaces - sets of feature vectors that do not trigger the model's refusal mechanisms - then use gradient-based optimization to reroute embeddings from refusal subspaces to acceptance subspaces, effectively achieving jailbreaks. This targeted approach significantly reduces computation cost, achieving attack success rates of 80-95% on state-of-the-art models including Gemma2, Llama3.2, and Qwen2.5 within minutes or even seconds, compared to existing techniques that often fail or require hours of computation. We believe this approach opens a new direction for both attack research and defense development. Furthermore, it showcases a practical application of mechanistic interpretability where other methods are less efficient, which highlights its utility. The code and generated datasets are available at https://github.com/Sckathach/subspace-rerouting.

Problem

Research questions and friction points this paper is trying to address.

Develops adversarial attacks using mechanistic interpretability.

Identifies and reroutes embeddings to bypass model refusal mechanisms.

Achieves high success rates with reduced computational costs.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages mechanistic interpretability for adversarial inputs

Identifies and reroutes embeddings to acceptance subspaces

Achieves high attack success rates efficiently

🔎 Similar Papers

No similar papers found.

Authors to Follow