MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks

๐Ÿ“… 2025-03-24
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Multimodal large language models (MLLMs) exhibit critical security vulnerabilities in cross-modal reasoning, rendering them susceptible to multimodal jailbreaking attacks that circumvent existing safety mechanisms and elicit harmful outputs. Method: We propose the first narrative-driven, role-immersive multimodal jailbreaking framework. It constructs interleaved image-text, multi-turn visual story sequences via environment-role-action triples to progressively weaken model defenses. By innovatively integrating role immersion with structured semantic reconstruction, our method reveals that narrative cues can spontaneously activate inherent model biasesโ€”overcoming limitations of text-only jailbreaking. Technically, it unifies Stable Diffusion-based image generation, multi-turn visual narrative modeling, cross-modal contextual guidance, and decomposition of toxic queries into semantic triples. Contribution/Results: Our framework achieves state-of-the-art attack success rates across six mainstream MLLMs and multiple benchmarks, outperforming the best prior baseline by 17.5%, thereby exposing systemic gaps in current cross-modal safety mechanisms.

Technology Category

Application Category

๐Ÿ“ Abstract
While safety mechanisms have significantly progressed in filtering harmful text inputs, MLLMs remain vulnerable to multimodal jailbreaks that exploit their cross-modal reasoning capabilities. We present MIRAGE, a novel multimodal jailbreak framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models (MLLMs). By systematically decomposing the toxic query into environment, role, and action triplets, MIRAGE constructs a multi-turn visual storytelling sequence of images and text using Stable Diffusion, guiding the target model through an engaging detective narrative. This process progressively lowers the model's defences and subtly guides its reasoning through structured contextual cues, ultimately eliciting harmful responses. In extensive experiments on the selected datasets with six mainstream MLLMs, MIRAGE achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines. Moreover, we demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards. These results highlight critical weaknesses in current multimodal safety mechanisms and underscore the urgent need for more robust defences against cross-modal threats.
Problem

Research questions and friction points this paper is trying to address.

Exploits multimodal reasoning to bypass MLLM safety mechanisms
Uses narrative-driven context to lower model defenses
Activates model biases to violate ethical safeguards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal jailbreak using narrative-driven context
Stable Diffusion for visual storytelling sequences
Role immersion to bypass safety mechanisms
๐Ÿ”Ž Similar Papers
No similar papers found.