GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models

📅 2026-01-06
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the vulnerability of safety alignment mechanisms in multimodal large language models (MLLMs) under adversarial inputs, noting that existing jailbreaking methods exhibit limited efficacy against reasoning-capable models. To overcome this limitation, the authors propose a gamified jailbreaking framework that innovatively integrates cognitive-stage decision guidance with gamification principles. By decomposing and recombining visual semantics, constructing suggestive scenarios, and embedding instruction traps, the framework steers the model to actively reconstruct malicious intent during task execution, thereby circumventing safety safeguards. Notably, this approach establishes the first structured multimodal reasoning chain tailored for reasoning-intensive MLLMs, effectively attenuating their safety-focused attention. Experimental results demonstrate attack success rates of 92.13%, 91.20%, and 85.87% on Gemini 2.5 Flash, QvQ-MAX, and GPT-4o, respectively, significantly outperforming current baselines.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have become widely deployed, yet their safety alignment remains fragile under adversarial inputs. Previous work has shown that increasing inference steps can disrupt safety mechanisms and lead MLLMs to generate attacker-desired harmful content. However, most existing attacks focus on increasing the complexity of the modified visual task itself and do not explicitly leverage the model's own reasoning incentives. This leads to them underperforming on reasoning models (Models with Chain-of-Thoughts) compared to non-reasoning ones (Models without Chain-of-Thoughts). If a model can think like a human, can we influence its cognitive-stage decisions so that it proactively completes a jailbreak? To validate this idea, we propose GAMBI} (Gamified Adversarial Multimodal Breakout via Instructional Traps), a novel multimodal jailbreak framework that decomposes and reassembles harmful visual semantics, then constructs a gamified scene that drives the model to explore, reconstruct intent, and answer as part of winning the game. The resulting structured reasoning chain increases task complexity in both vision and text, positioning the model as a participant whose goal pursuit reduces safety attention and induces it to answer the reconstructed malicious query. Extensive experiments on popular reasoning and non-reasoning MLLMs demonstrate that GAMBIT achieves high Attack Success Rates (ASR), reaching 92.13% on Gemini 2.5 Flash, 91.20% on QvQ-MAX, and 85.87% on GPT-4o, significantly outperforming baselines.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
jailbreak
adversarial attacks
reasoning models
safety alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gamified Jailbreak
Multimodal Large Language Models
Chain-of-Thought Reasoning
Adversarial Alignment Evasion
Instructional Traps
🔎 Similar Papers