Adversarial Confusion Attack: Disrupting Multimodal Large Language Models

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the reliability of multimodal large language models (MLLMs) in intelligent agents by proposing a novel adversarial confusion attack: a single adversarial image induces high-confidence yet semantically incoherent erroneous outputs, systematically degrading cross-modal reasoning. Methodologically, we introduce entropy maximization over the next-token distribution as the primary optimization objective, designing both white-box and CAPTCHA-compatible perturbation strategies within the PGD framework; we further enhance attack robustness via ensemble inference over open-source MLLMs. The attack demonstrates strong cross-model transferability and generalization—successfully compromising unseen models including Qwen3-VL and GPT-5.1—while achieving efficient interference in both full-image and CAPTCHA scenarios. It transcends conventional jailbreak or classification-based attacks, establishing a new paradigm for MLLM security evaluation.

Technology Category

Application Category

📝 Abstract
We introduce the Adversarial Confusion Attack, a new class of threats against multimodal large language models (MLLMs). Unlike jailbreaks or targeted misclassification, the goal is to induce systematic disruption that makes the model generate incoherent or confidently incorrect outputs. Applications include embedding adversarial images into websites to prevent MLLM-powered agents from operating reliably. The proposed attack maximizes next-token entropy using a small ensemble of open-source MLLMs. In the white-box setting, we show that a single adversarial image can disrupt all models in the ensemble, both in the full-image and adversarial CAPTCHA settings. Despite relying on a basic adversarial technique (PGD), the attack generates perturbations that transfer to both unseen open-source (e.g., Qwen3-VL) and proprietary (e.g., GPT-5.1) models.
Problem

Research questions and friction points this paper is trying to address.

Introducing Adversarial Confusion Attack to disrupt multimodal large language models
Inducing systematic disruption for incoherent or incorrect model outputs
Generating transferable adversarial perturbations affecting various MLLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial images maximize next-token entropy
Single perturbation disrupts multiple MLLMs
Basic PGD technique transfers to unseen models
🔎 Similar Papers
No similar papers found.