Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs

📅 2025-01-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the weak cross-model adversarial transferability of video multimodal large language models (V-MLLMs) in black-box settings. We propose I2V-MLLM, the first attack method leveraging an image-based multimodal model (e.g., BLIP-2) as a surrogate to generate highly transferable adversarial videos. Our approach introduces three key innovations: (1) latent-space perturbation of video representations, (2) multimodal feature fusion across vision and language modalities, and (3) a temporal-aware perturbation propagation mechanism. These overcome critical limitations of prior methods—including poor generalizability, biased frame selection, and disentangled visual–linguistic information. Evaluated on MSVD-QA and MSRVTT-QA video question answering benchmarks, I2V-MLLM achieves black-box attack success rates of 55.48% and 58.26%, respectively—matching white-box performance and substantially enhancing robust cross-V-MLLM transferability.

Technology Category

Application Category

📝 Abstract
Video-based multimodal large language models (V-MLLMs) have shown vulnerability to adversarial examples in video-text multimodal tasks. However, the transferability of adversarial videos to unseen models--a common and practical real world scenario--remains unexplored. In this paper, we pioneer an investigation into the transferability of adversarial video samples across V-MLLMs. We find that existing adversarial attack methods face significant limitations when applied in black-box settings for V-MLLMs, which we attribute to the following shortcomings: (1) lacking generalization in perturbing video features, (2) focusing only on sparse key-frames, and (3) failing to integrate multimodal information. To address these limitations and deepen the understanding of V-MLLM vulnerabilities in black-box scenarios, we introduce the Image-to-Video MLLM (I2V-MLLM) attack. In I2V-MLLM, we utilize an image-based multimodal model (IMM) as a surrogate model to craft adversarial video samples. Multimodal interactions and temporal information are integrated to disrupt video representations within the latent space, improving adversarial transferability. In addition, a perturbation propagation technique is introduced to handle different unknown frame sampling strategies. Experimental results demonstrate that our method can generate adversarial examples that exhibit strong transferability across different V-MLLMs on multiple video-text multimodal tasks. Compared to white-box attacks on these models, our black-box attacks (using BLIP-2 as surrogate model) achieve competitive performance, with average attack success rates of 55.48% on MSVD-QA and 58.26% on MSRVTT-QA for VideoQA tasks, respectively. Our code will be released upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

V-MLLMs Robustness
Cross-modal Adversarial Examples
Black-box Scenario
Innovation

Methods, ideas, or system contributions that make the work stand out.

I2V-MLLM Attack
Cross-modal Video-Language Models
Black-box Scenario
🔎 Similar Papers
No similar papers found.
Linhao Huang
Linhao Huang
Beijing University of Technology
X
Xue Jiang
Southern University of Science and Technology
Z
Zhiqiang Wang
Hong Kong University of Science and Technology
Wentao Mo
Wentao Mo
Tsinghua University
Trustworthy Artificial IntelligenceMultimodal Learning
Xi Xiao
Xi Xiao
Oak Ridge National Laboratory | University of Alabama at Birmingham
LLM / MLLM EfficiencyImage / Video GenerationImage / Video Understanding
B
Bo Han
TMLR Group, Hong Kong Baptist University
Y
Yongjie Yin
China Electronics Corporation
F
Feng Zheng
Southern University of Science and Technology