AMIA: Automatic Masking and Joint Intention Analysis Makes LVLMs Robust Jailbreak Defenders

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

To address the vulnerability of large vision-language models (LVLMs) to jailbreak attacks, this paper proposes a lightweight, training-free, inference-time defense method based on dual-path reasoning. First, it dynamically identifies and masks image patches irrelevant to the textual query via attention mechanisms, thereby disrupting the propagation of adversarial perturbations. Second, it employs multi-step cross-modal intent disentanglement modeling coupled with a zero-shot instruction safety classifier to proactively detect latent harmful instructions. This work introduces the novel “image masking disruption + cross-modal intent interception” synergy paradigm, achieving robust protection without modifying model parameters. Experiments demonstrate that the average defense success rate increases significantly—from 52.4% to 81.7%—while incurring only a 2% drop in general task accuracy and introducing manageable inference overhead.

Technology Category

Application Category

📝 Abstract

We introduce AMIA, a lightweight, inference-only defense for Large Vision-Language Models (LVLMs) that (1) Automatically Masks a small set of text-irrelevant image patches to disrupt adversarial perturbations, and (2) conducts joint Intention Analysis to uncover and mitigate hidden harmful intents before response generation. Without any retraining, AMIA improves defense success rates across diverse LVLMs and jailbreak benchmarks from an average of 52.4% to 81.7%, preserves general utility with only a 2% average accuracy drop, and incurs only modest inference overhead. Ablation confirms both masking and intention analysis are essential for a robust safety-utility trade-off.

Problem

Research questions and friction points this paper is trying to address.

Disrupt adversarial perturbations in LVLMs

Uncover and mitigate hidden harmful intents

Improve defense success rates without retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically masks irrelevant image patches

Conducts joint harmful intent analysis

Lightweight inference-only defense method

🔎 Similar Papers

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models