Safe Vision-Language Models via Unsafe Weights Manipulation

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Vision-language models (VLMs) inherit dataset biases, and existing training-based safety alignment methods often degrade performance on safe inputs. Method: We propose a training-free Unsafe Weights Manipulation (UWM) paradigm. Leveraging fine-grained evaluation from SafeGround, UWM identifies critical parameters by contrasting intra-layer activations between safe and unsafe samples, then applies targeted weight sign flipping. Contribution/Results: UWM is the first method to uncover and mitigate the counterintuitive performance drop on safe inputs induced by safety alignment. Without altering model architecture or updating parameters, it simultaneously enhances safety and preserves knowledge capabilities. Experiments show that UWM significantly improves safety on unsafe queries while outperforming all state-of-the-art training-based methods on safe queries—achieving near-lossless retention of knowledge capacity.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) often inherit the biases and unsafe associations present within their large-scale training dataset. While recent approaches mitigate unsafe behaviors, their evaluation focuses on how safe the model is on unsafe inputs, ignoring potential shortcomings on safe ones. In this paper, we first revise safety evaluation by introducing SafeGround, a new set of metrics that evaluate safety at different levels of granularity. With this metric, we uncover a surprising issue of training-based methods: they make the model less safe on safe inputs. From this finding, we take a different direction and explore whether it is possible to make a model safer without training, introducing Unsafe Weights Manipulation (UWM). UWM uses a calibration set of safe and unsafe instances to compare activations between safe and unsafe content, identifying the most important parameters for processing the latter. Their values are then manipulated via negation. Experiments show that UWM achieves the best tradeoff between safety and knowledge preservation, consistently improving VLMs on unsafe queries while outperforming even training-based state-of-the-art methods on safe ones.

Problem

Research questions and friction points this paper is trying to address.

Address biases and unsafe associations in vision-language models

Introduce SafeGround metrics for granular safety evaluation

Develop Unsafe Weights Manipulation to enhance model safety without training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces SafeGround for granular safety evaluation.

Proposes Unsafe Weights Manipulation (UWM) technique.

UWM negates key parameters for safer VLMs.

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?