Understanding and Rectifying Safety Perception Distortion in VLMs

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a “safety perception distortion” phenomenon in vision-language models (VLMs), wherein multimodal inputs—particularly visual modality—induce systematic overestimation of the safety of harmful queries, rendering VLMs more vulnerable to jailbreaking attacks than text-only LLMs. To address this, the authors formally characterize the problem and propose ShiftDC, the first training-free correction framework. ShiftDC decouples safety-irrelevant biases introduced by vision through activation shift decomposition and applies zero-shot calibration to restore safety alignment while preserving multimodal capabilities. Evaluated across multiple safety benchmarks, ShiftDC significantly improves robustness against adversarial prompts without compromising performance on standard vision-language tasks. Crucially, it requires no fine-tuning, additional training data, or architectural modifications, enabling efficient and scalable safety enhancement for existing VLMs.

Technology Category

Application Category

📝 Abstract
Recent studies reveal that vision-language models (VLMs) become more susceptible to harmful requests and jailbreak attacks after integrating the vision modality, exhibiting greater vulnerability than their text-only LLM backbones. To uncover the root cause of this phenomenon, we conduct an in-depth analysis and identify a key issue: multimodal inputs introduce an modality-induced activation shift toward a"safer"direction compared to their text-only counterparts, leading VLMs to systematically overestimate the safety of harmful inputs. We refer to this issue as safety perception distortion. To mitigate such distortion, we propose Activation Shift Disentanglement and Calibration (ShiftDC), a training-free method that decomposes and calibrates the modality-induced activation shift to reduce the impact of modality on safety. By isolating and removing the safety-relevant component, ShiftDC restores the inherent safety alignment of the LLM backbone while preserving the vision-language capabilities of VLMs. Empirical results demonstrate that ShiftDC significantly enhances alignment performance on safety benchmarks without impairing model utility.
Problem

Research questions and friction points this paper is trying to address.

Address safety perception distortion in VLMs
Mitigate vulnerability to harmful requests
Calibrate modality-induced activation shift
Innovation

Methods, ideas, or system contributions that make the work stand out.

Activation Shift Disentanglement
Calibration of modality-induced shift
Enhancing VLM safety alignment
🔎 Similar Papers
2024-08-192024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC)Citations: 0
Xiaohan Zou
Xiaohan Zou
Pennsylvania State University
Machine LearningComputer Vision
J
Jian Kang
University of Rochester
G
G. Kesidis
The Pennsylvania State University
L
Lu Lin
The Pennsylvania State University