ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time

📅 2024-10-09

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Vision-language models (VLMs) are vulnerable to adversarial image attacks, and existing defenses struggle to simultaneously ensure security, practicality, and efficiency. Method: We propose a two-stage inference-time alignment framework: (1) a multimodal safety evaluator detects image-text inconsistency; (2) joint interference-aware prefix conditioning and sentence-level best-of-N search collaboratively align generation paths and intervene in output distributions. Our method is zero-shot, plug-and-play, requiring no fine-tuning or additional training. Contribution/Results: It is the first work to introduce an “evaluate–align” dual-stage mechanism for real-time VLM safety. Under cross-modal adversarial attacks, unsafe response rates drop by 87.5%. GPT-4-based human evaluation yields a 96.6% win/tie rate against baselines. The framework achieves strong security guarantees without compromising utility or computational efficiency—striking a practical balance among safety, helpfulness, and inference speed.

Technology Category

Application Category

📝 Abstract

Vision Language Models (VLMs) have become essential backbones for multimodal intelligence, yet significant safety challenges limit their real-world application. While textual inputs are often effectively safeguarded, adversarial visual inputs can easily bypass VLM defense mechanisms. Existing defense methods are either resource-intensive, requiring substantial data and compute, or fail to simultaneously ensure safety and usefulness in responses. To address these limitations, we propose a novel two-phase inference-time alignment framework, Evaluating Then Aligning (ETA): 1) Evaluating input visual contents and output responses to establish a robust safety awareness in multimodal settings, and 2) Aligning unsafe behaviors at both shallow and deep levels by conditioning the VLMs' generative distribution with an interference prefix and performing sentence-level best-of-N to search the most harmless and helpful generation paths. Extensive experiments show that ETA outperforms baseline methods in terms of harmlessness, helpfulness, and efficiency, reducing the unsafe rate by 87.5% in cross-modality attacks and achieving 96.6% win-ties in GPT-4 helpfulness evaluation. The code is publicly available at https://github.com/DripNowhy/ETA.

Problem

Research questions and friction points this paper is trying to address.

Addressing safety in Vision Language Models

Mitigating adversarial visual input risks

Ensuring safe and useful model responses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-phase inference-time alignment

Interference prefix conditioning

Sentence-level best-of-N search

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?