Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks

📅 2026-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing black-box jailbreak attacks rely on fixed image-text pairs, which struggle to bypass the increasingly robust multimodal safety alignment mechanisms in vision-language models. To address this limitation, this work proposes CrossTALK, a novel framework that enables scalable cross-modal entanglement attacks. CrossTALK decomposes malicious intent into multi-hop instructions and dynamically fuses visual and textual semantics through cross-modal cue entanglement, knowledge-extensible reconstruction, and scene-nesting strategies, thereby generating highly deceptive adversarial examples. Extensive experiments demonstrate that CrossTALK significantly outperforms current black-box attack methods across multiple mainstream vision-language models, achieving state-of-the-art jailbreak success rates.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) with multimodal reasoning capabilities are high-value attack targets, given their potential for handling complex multimodal harmful tasks. Mainstream black-box jailbreak attacks on VLMs work by distributing malicious clues across modalities to disperse model attention and bypass safety alignment mechanisms. However, these adversarial attacks rely on simple and fixed image-text combinations that lack attack complexity scalability, limiting their effectiveness for red-teaming VLMs'continuously evolving reasoning capabilities. We propose \textbf{CrossTALK} (\textbf{\underline{Cross}}-modal en\textbf{\underline{TA}}ng\textbf{\underline{L}}ement attac\textbf{\underline{K}}), which is a scalable approach that extends and entangles information clues across modalities to exceed VLMs'trained and generalized safety alignment patterns for jailbreak. Specifically, {knowledge-scalable reframing} extends harmful tasks into multi-hop chain instructions, {cross-modal clue entangling} migrates visualizable entities into images to build multimodal reasoning links, and {cross-modal scenario nesting} uses multimodal contextual instructions to steer VLMs toward detailed harmful outputs. Experiments show our COMET achieves state-of-the-art attack success rate.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
jailbreak attacks
multimodal reasoning
red-teaming
cross-modal entanglement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal Entanglement
Vision-Language Models
Jailbreaking
Red-teaming
Multimodal Reasoning
🔎 Similar Papers
No similar papers found.
Y
Yu Yan
Institute of Computing Technology, Chinese Academy of Sciences
Sheng Sun
Sheng Sun
Institute of Computing Technology, Chinese Academy of Sciences
federated learningedge intelligence
S
Shengjia Cheng
People’s Public Security University of China
T
Teli Liu
People’s Public Security University of China
M
Mingfeng Li
People’s Public Security University of China
Min Liu
Min Liu
Institute of computing technology, CAS
ComputingNetworking