Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
This work addresses the limited robustness of current vision-language models under multimodal adversarial attacks, noting that existing approaches often rely on single-modality perturbations or white-box access, limiting their applicability in real-world scenarios. To overcome these limitations, the authors propose a black-box, universal multimodal adversarial attack framework that jointly optimizes image and text perturbations to achieve strong transferability. The method incorporates wavelet-based texture constraints to ensure visual imperceptibility of image perturbations, employs L-norm constraints in the embedding space to preserve textual semantics, and introduces a cross-modal gradient alignment regularizer to enhance attack generalizability. Experimental results demonstrate that the proposed approach significantly degrades the performance of state-of-the-art vision-language models using only query-based access, achieving superior attack effectiveness across diverse models and tasks.
📝 Abstract
Large Vision-Language Models (LVLMs) have transformed multi-modal understanding, excelling in tasks like image captioning and visual question answering by integrating visual and textual inputs. However, their robustness against adversarial attacks, particularly those exploiting both modalities, remains underexplored, posing risks to critical applications like autonomous driving and content moderation. Existing attacks focus on single modalities or require impractical white-box access, limiting their real-world relevance. In this paper, we introduce Multi-Modal Adversarial Synergy, a groundbreaking framework that crafts universal, black-box multi-modal attacks against LVLMs. MMAS simultaneously generates a texture scale-constrained universal adversarial perturbation for images and a learnable prompt perturbation for text, optimized jointly using only model queries. The image perturbation leverages wavelet-based texture constraints to ensure imperceptibility and robustness across diverse visual inputs. The text perturbation, constrained by an L-norm in the embedding space, maintains semantic coherence while steering outputs toward a target. A novel cross-modal regularization term aligns the perturbations' gradient directions, enhancing their synergistic impact and transferability across tasks and models. Extensive experiments show the strong universal adversarial capabilities of our proposed attack with prevalent LVLMs.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
Adversarial Attacks
Multi-Modal Robustness
Black-Box Attack
Cross-Modal Vulnerability
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-modal adversarial attack
texture-constrained perturbation
cross-modal optimization
black-box attack
universal adversarial perturbation