🤖 AI Summary
Existing interpretability methods—particularly activation-based ones—assume input data distributions closely resemble the training distribution, rendering them ineffective against novel threats (e.g., backdoor attacks) under out-of-distribution inputs. To address this, we propose the first data-free auditing framework grounded in singular vector analysis of model weight differences: it identifies critical directions introduced by fine-tuning via SVD, monitors cosine similarity of activations along these directions, and integrates linear projection with behavioral trigger analysis to accurately detect and intervene against backdoor attacks and machine unlearning. Crucially, our method requires neither access to original training data nor distribution-matched inputs, enabling precise localization of anomalous behavior and even recovery of “forgotten” knowledge. Experiments demonstrate 100% backdoor detection (false positive rate <1.2%), 95.42% accuracy in unlearning inference detection, and successful identification of fine-tuning intent in commercial large language models—establishing a novel paradigm for black-box model security auditing.
📝 Abstract
The releases of powerful open-weight large language models (LLMs) are often not accompanied by access to their full training data. Existing interpretability methods, particularly those based on activations, often require or assume distributionally similar data. This is a significant limitation when detecting and defending against novel potential threats like backdoors, which are by definition out-of-distribution.
In this work, we introduce a new method for understanding, monitoring and controlling fine-tuned LLMs that interprets weights, rather than activations, thereby side stepping the need for data that is distributionally similar to the unknown training data. We demonstrate that the top singular vectors of the weight difference between a fine-tuned model and its base model correspond to newly acquired behaviors. By monitoring the cosine similarity of activations along these directions, we can detect salient behaviors introduced during fine-tuning with high precision.
For backdoored models that bypasses safety mechanisms when a secret trigger is present, our method stops up to 100% of attacks with a false positive rate below 1.2%. For models that have undergone unlearning, we detect inference on erased topics with accuracy up to 95.42% and can even steer the model to recover "unlearned" information. Besides monitoring, our method also shows potential for pre-deployment model auditing: by analyzing commercial instruction-tuned models (OLMo, Llama, Qwen), we are able to uncover model-specific fine-tuning focus including marketing strategies and Midjourney prompt generation.
Our implementation can be found at https://github.com/fjzzq2002/WeightWatch.