🤖 AI Summary
Current safety mechanisms struggle to defend against cross-modal jailbreak attacks where inputs appear benign but outputs are harmful, particularly when malicious intent is not explicitly present in the input. This work proposes the Distributed Semantic Reconstruction (DSR) framework, which implicitly encodes harmful intent into innocuous textual and visual primitives through semantic decomposition, cross-modal alignment, and reasoning guidance, thereby steering multimodal models to generate harmful content during inference. DSR achieves high-success jailbreaks with near-zero input toxicity, exposing critical blind spots in existing defenses across multiple commercial multimodal large language models. The approach reveals a fundamental tension between model utility and safety—the utility-safety paradox—highlighting how advanced reasoning capabilities can be exploited to circumvent conventional safeguards.
📝 Abstract
Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in content synthesis and autonomous reasoning. Previous safety guardrails are primarily designed for unimodal textual input interception, leaving them vulnerable to cross-modal jailbreak attacks. However, regardless unimodal textual attack or cross-modal jailbreak, typically inclusive part of explicit harmful or sensitive content at the input level, which is called Harm-Bearing. It allow the model's safety filters to detect and block such content easily. To address this limitations, we propose Distributed Semantic Recomposition (DSR), a novel cross-modal jailbreak framework that decomposes harmful intent into a set of benign textual and visual primitives. By exploiting the model's reasoning ability, DSR enables the latent fusion of these seemingly innocent components into harmful outputs during the cross-modal inference phase. Extensive experiments on multiple commercial MLLMs pipelines demonstrate that DSR achieves superior attack success rates while maintaining an extremely low or even negligible input toxicity rate. Our findings uncover a critical Utility-Safety Paradox in MLLMs, where the model's instruction-following proficiency facilitates its own cognitive exploitation. Content Warning: This paper contains harmful model responses.