Steering Vectors are an Adversarial Attack Surface

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work reveals a previously overlooked security vulnerability in activation vector steering mechanisms, which can be exploited through stealthy data poisoning to compromise large language models. We propose the first covert poisoning attack targeting steering vectors, wherein minimal token substitutions in the steering data yield malicious vectors that simultaneously preserve benign steering behavior and induce model refusal failures. To counter this threat, we introduce verifiable secure bundles and a defense strategy based on directional orthogonalization. Empirical evaluations across multiple open-source models demonstrate that the attack achieves success rates of 20%–55%, representing a 19%–51% increase over clean vectors, while our orthogonalization-based defense recovers approximately 82% of the compromised safety performance without degrading normal functionality.

📝 Abstract

Activation steering has become a popular way to control Large Language Model (LLM) behavior without fine-tuning. Since the technique is plug-and-play, users share datasets and precomputed vectors to steer model activations. However, we show that a \emph{stealth data poisoning attack} silently compromises this pipeline. By substituting $4{-}6\%$ of tokens in the steering dataset, an attacker can silently align the resulting vector with an anti-refusal direction. This jailbreaks the target model while preserving the intended steering effect on benign prompts. Under this threat model, a malicious actor can distribute an apparently safe bundle containing texts, vectors, and weights, alongside an equivalence certificate that the end-user can verify. We test the attack on two open-weight model families and eight model-attribute combinations, observing that poisoned vectors reach an absolute attack success rate (ASR) of $20{-}55\%$, $+19\%$ to $+51\%$ over a clean reference. Finally, we find that a refusal-direction orthogonalization defense can recover ${\approx}82\%$ of the ASR gap without harming benign behavior.

Problem

Research questions and friction points this paper is trying to address.

steering vectors

adversarial attack

data poisoning

LLM safety

activation steering

Innovation

Methods, ideas, or system contributions that make the work stand out.

stealth data poisoning

activation steering

adversarial attack