Adversarial Confusion Attack: Disrupting Multimodal Large Language Models

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the reliability of multimodal large language models (MLLMs) in intelligent agents by proposing a novel adversarial confusion attack: a single adversarial image induces high-confidence yet semantically incoherent erroneous outputs, systematically degrading cross-modal reasoning. Methodologically, we introduce entropy maximization over the next-token distribution as the primary optimization objective, designing both white-box and CAPTCHA-compatible perturbation strategies within the PGD framework; we further enhance attack robustness via ensemble inference over open-source MLLMs. The attack demonstrates strong cross-model transferability and generalization—successfully compromising unseen models including Qwen3-VL and GPT-5.1—while achieving efficient interference in both full-image and CAPTCHA scenarios. It transcends conventional jailbreak or classification-based attacks, establishing a new paradigm for MLLM security evaluation.

Technology Category

Application Category

📝 Abstract

We introduce the Adversarial Confusion Attack, a new class of threats against multimodal large language models (MLLMs). Unlike jailbreaks or targeted misclassification, the goal is to induce systematic disruption that makes the model generate incoherent or confidently incorrect outputs. Applications include embedding adversarial images into websites to prevent MLLM-powered agents from operating reliably. The proposed attack maximizes next-token entropy using a small ensemble of open-source MLLMs. In the white-box setting, we show that a single adversarial image can disrupt all models in the ensemble, both in the full-image and adversarial CAPTCHA settings. Despite relying on a basic adversarial technique (PGD), the attack generates perturbations that transfer to both unseen open-source (e.g., Qwen3-VL) and proprietary (e.g., GPT-5.1) models.

Problem

Research questions and friction points this paper is trying to address.

Introducing Adversarial Confusion Attack to disrupt multimodal large language models

Inducing systematic disruption for incoherent or incorrect model outputs

Generating transferable adversarial perturbations affecting various MLLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial images maximize next-token entropy

Single perturbation disrupts multiple MLLMs

Basic PGD technique transfers to unseen models

🔎 Similar Papers

No similar papers found.

Authors to Follow