Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs

📅 2025-07-31

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing interpretability methods—particularly activation-based ones—assume input data distributions closely resemble the training distribution, rendering them ineffective against novel threats (e.g., backdoor attacks) under out-of-distribution inputs. To address this, we propose the first data-free auditing framework grounded in singular vector analysis of model weight differences: it identifies critical directions introduced by fine-tuning via SVD, monitors cosine similarity of activations along these directions, and integrates linear projection with behavioral trigger analysis to accurately detect and intervene against backdoor attacks and machine unlearning. Crucially, our method requires neither access to original training data nor distribution-matched inputs, enabling precise localization of anomalous behavior and even recovery of “forgotten” knowledge. Experiments demonstrate 100% backdoor detection (false positive rate <1.2%), 95.42% accuracy in unlearning inference detection, and successful identification of fine-tuning intent in commercial large language models—establishing a novel paradigm for black-box model security auditing.

Technology Category

Application Category

📝 Abstract

The releases of powerful open-weight large language models (LLMs) are often not accompanied by access to their full training data. Existing interpretability methods, particularly those based on activations, often require or assume distributionally similar data. This is a significant limitation when detecting and defending against novel potential threats like backdoors, which are by definition out-of-distribution. In this work, we introduce a new method for understanding, monitoring and controlling fine-tuned LLMs that interprets weights, rather than activations, thereby side stepping the need for data that is distributionally similar to the unknown training data. We demonstrate that the top singular vectors of the weight difference between a fine-tuned model and its base model correspond to newly acquired behaviors. By monitoring the cosine similarity of activations along these directions, we can detect salient behaviors introduced during fine-tuning with high precision. For backdoored models that bypasses safety mechanisms when a secret trigger is present, our method stops up to 100% of attacks with a false positive rate below 1.2%. For models that have undergone unlearning, we detect inference on erased topics with accuracy up to 95.42% and can even steer the model to recover "unlearned" information. Besides monitoring, our method also shows potential for pre-deployment model auditing: by analyzing commercial instruction-tuned models (OLMo, Llama, Qwen), we are able to uncover model-specific fine-tuning focus including marketing strategies and Midjourney prompt generation. Our implementation can be found at https://github.com/fjzzq2002/WeightWatch.

Problem

Research questions and friction points this paper is trying to address.

Detect novel threats in fine-tuned LLMs without training data

Monitor and control model behaviors via weight analysis

Audit pre-deployment models for hidden fine-tuning focuses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Monitors LLM weights instead of activations

Uses singular vectors for behavior detection

Detects and controls backdoors with high accuracy

🔎 Similar Papers

No similar papers found.

Authors to Follow