π€ AI Summary
This work reveals a previously overlooked security vulnerability in activation vector steering mechanisms, which can be exploited through stealthy data poisoning to compromise large language models. We propose the first covert poisoning attack targeting steering vectors, wherein minimal token substitutions in the steering data yield malicious vectors that simultaneously preserve benign steering behavior and induce model refusal failures. To counter this threat, we introduce verifiable secure bundles and a defense strategy based on directional orthogonalization. Empirical evaluations across multiple open-source models demonstrate that the attack achieves success rates of 20%β55%, representing a 19%β51% increase over clean vectors, while our orthogonalization-based defense recovers approximately 82% of the compromised safety performance without degrading normal functionality.
π Abstract
Activation steering has become a popular way to control Large Language Model (LLM) behavior without fine-tuning. Since the technique is plug-and-play, users share datasets and precomputed vectors to steer model activations. However, we show that a \emph{stealth data poisoning attack} silently compromises this pipeline. By substituting $4{-}6\%$ of tokens in the steering dataset, an attacker can silently align the resulting vector with an anti-refusal direction. This jailbreaks the target model while preserving the intended steering effect on benign prompts. Under this threat model, a malicious actor can distribute an apparently safe bundle containing texts, vectors, and weights, alongside an equivalence certificate that the end-user can verify. We test the attack on two open-weight model families and eight model-attribute combinations, observing that poisoned vectors reach an absolute attack success rate (ASR) of $20{-}55\%$, $+19\%$ to $+51\%$ over a clean reference. Finally, we find that a refusal-direction orthogonalization defense can recover ${\approx}82\%$ of the ASR gap without harming benign behavior.