ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time

📅 2024-10-09
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) are vulnerable to adversarial image attacks, and existing defenses struggle to simultaneously ensure security, practicality, and efficiency. Method: We propose a two-stage inference-time alignment framework: (1) a multimodal safety evaluator detects image-text inconsistency; (2) joint interference-aware prefix conditioning and sentence-level best-of-N search collaboratively align generation paths and intervene in output distributions. Our method is zero-shot, plug-and-play, requiring no fine-tuning or additional training. Contribution/Results: It is the first work to introduce an “evaluate–align” dual-stage mechanism for real-time VLM safety. Under cross-modal adversarial attacks, unsafe response rates drop by 87.5%. GPT-4-based human evaluation yields a 96.6% win/tie rate against baselines. The framework achieves strong security guarantees without compromising utility or computational efficiency—striking a practical balance among safety, helpfulness, and inference speed.

Technology Category

Application Category

📝 Abstract
Vision Language Models (VLMs) have become essential backbones for multimodal intelligence, yet significant safety challenges limit their real-world application. While textual inputs are often effectively safeguarded, adversarial visual inputs can easily bypass VLM defense mechanisms. Existing defense methods are either resource-intensive, requiring substantial data and compute, or fail to simultaneously ensure safety and usefulness in responses. To address these limitations, we propose a novel two-phase inference-time alignment framework, Evaluating Then Aligning (ETA): 1) Evaluating input visual contents and output responses to establish a robust safety awareness in multimodal settings, and 2) Aligning unsafe behaviors at both shallow and deep levels by conditioning the VLMs' generative distribution with an interference prefix and performing sentence-level best-of-N to search the most harmless and helpful generation paths. Extensive experiments show that ETA outperforms baseline methods in terms of harmlessness, helpfulness, and efficiency, reducing the unsafe rate by 87.5% in cross-modality attacks and achieving 96.6% win-ties in GPT-4 helpfulness evaluation. The code is publicly available at https://github.com/DripNowhy/ETA.
Problem

Research questions and friction points this paper is trying to address.

Addressing safety in Vision Language Models
Mitigating adversarial visual input risks
Ensuring safe and useful model responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-phase inference-time alignment
Interference prefix conditioning
Sentence-level best-of-N search
🔎 Similar Papers
No similar papers found.