Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing black-box jailbreak attacks rely on fixed image-text pairs, which struggle to bypass the increasingly robust multimodal safety alignment mechanisms in vision-language models. To address this limitation, this work proposes CrossTALK, a novel framework that enables scalable cross-modal entanglement attacks. CrossTALK decomposes malicious intent into multi-hop instructions and dynamically fuses visual and textual semantics through cross-modal cue entanglement, knowledge-extensible reconstruction, and scene-nesting strategies, thereby generating highly deceptive adversarial examples. Extensive experiments demonstrate that CrossTALK significantly outperforms current black-box attack methods across multiple mainstream vision-language models, achieving state-of-the-art jailbreak success rates.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) with multimodal reasoning capabilities are high-value attack targets, given their potential for handling complex multimodal harmful tasks. Mainstream black-box jailbreak attacks on VLMs work by distributing malicious clues across modalities to disperse model attention and bypass safety alignment mechanisms. However, these adversarial attacks rely on simple and fixed image-text combinations that lack attack complexity scalability, limiting their effectiveness for red-teaming VLMs'continuously evolving reasoning capabilities. We propose \textbf{CrossTALK} (\textbf{\underline{Cross}}-modal en\textbf{\underline{TA}}ng\textbf{\underline{L}}ement attac\textbf{\underline{K}}), which is a scalable approach that extends and entangles information clues across modalities to exceed VLMs'trained and generalized safety alignment patterns for jailbreak. Specifically, {knowledge-scalable reframing} extends harmful tasks into multi-hop chain instructions, {cross-modal clue entangling} migrates visualizable entities into images to build multimodal reasoning links, and {cross-modal scenario nesting} uses multimodal contextual instructions to steer VLMs toward detailed harmful outputs. Experiments show our COMET achieves state-of-the-art attack success rate.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

jailbreak attacks

multimodal reasoning

red-teaming

cross-modal entanglement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal Entanglement

Vision-Language Models

Jailbreaking

Red-teaming

Multimodal Reasoning

🔎 Similar Papers

No similar papers found.

Authors to Follow