Why does weak-OOD help? A Further Step Towards Understanding Jailbreaking VLMs

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work investigates why weak out-of-distribution (weak-OOD) inputs effectively evade safety alignment mechanisms in large vision-language models (VLMs). We identify a critical misalignment: discrepancies between pretraining objectives (e.g., image-text matching) and alignment objectives (e.g., safety fine-tuning) cause VLMs to exhibit inconsistent sensitivity—accurately perceiving visual semantics yet failing to reliably infer malicious user intent. To address this, we propose the first theoretical framework explaining weak-OOD jailbreaking and introduce SI-OCR, a novel attack paradigm that integrates OCR-augmented semantic perturbations with explicit modeling of VLM pretraining characteristics to bypass safety rejections while preserving image naturalness. Extensive experiments demonstrate that SI-OCR achieves state-of-the-art performance across mainstream VLMs—including Qwen-VL, LLaVA, and InternVL—improving attack success rate by 23.6% and cross-model transferability by 31.4% over prior methods.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (VLMs) are susceptible to jailbreak attacks: researchers have developed a variety of attack strategies that can successfully bypass the safety mechanisms of VLMs. Among these approaches, jailbreak methods based on the Out-of-Distribution (OOD) strategy have garnered widespread attention due to their simplicity and effectiveness. This paper further advances the in-depth understanding of OOD-based VLM jailbreak methods. Experimental results demonstrate that jailbreak samples generated via mild OOD strategies exhibit superior performance in circumventing the safety constraints of VLMs--a phenomenon we define as''weak-OOD''. To unravel the underlying causes of this phenomenon, this study takes SI-Attack, a typical OOD-based jailbreak method, as the research object. We attribute this phenomenon to a trade-off between two dominant factors: input intent perception and model refusal triggering. The inconsistency in how these two factors respond to OOD manipulations gives rise to this phenomenon. Furthermore, we provide a theoretical argument for the inevitability of such inconsistency from the perspective of discrepancies between model pre-training and alignment processes. Building on the above insights, we draw inspiration from optical character recognition (OCR) capability enhancement--a core task in the pre-training phase of mainstream VLMs. Leveraging this capability, we design a simple yet highly effective VLM jailbreak method, whose performance outperforms that of SOTA baselines.

Problem

Research questions and friction points this paper is trying to address.

Investigating why weak-OOD strategies effectively bypass VLM safety mechanisms

Analyzing trade-offs between intent perception and refusal triggering in jailbreaks

Developing enhanced jailbreak methods using OCR-inspired pre-training capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Weak-OOD strategy enhances jailbreak sample generation

Leverages OCR capability from VLM pre-training phase

Analyzes trade-off between intent perception and refusal triggering

🔎 Similar Papers

No similar papers found.