Leveraging Many-To-Many Relationships for Defending Against Visual-Language Adversarial Attacks

📅 2024-05-29
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language (VL) models exhibit insufficient robustness against multimodal adversarial attacks. Moreover, prevailing defense strategies are restricted to unimodal (image-only) perturbations and assume one-to-one (1:1) image–text pairings, thereby neglecting the inherent many-to-many (N:N) semantic relationships between images and texts. This work presents the first N:N relation-aware multimodal robust learning framework tailored for image–text retrieval. It integrates controllable 1:N/N:1 cross-modal augmentation, generative cross-modal perturbation augmentation, and alignment-aware loss optimization—effectively mitigating 1:1 overfitting while enforcing semantic consistency across modalities. Extensive experiments demonstrate that our approach significantly improves retrieval robustness under diverse white-box and black-box multimodal attacks, without compromising original task performance. These results validate the critical role of N:N structural modeling in secure VL learning.

Technology Category

Application Category

📝 Abstract
Recent studies have revealed that vision-language (VL) models are vulnerable to adversarial attacks for image-text retrieval (ITR). However, existing defense strategies for VL models primarily focus on zero-shot image classification, which do not consider the simultaneous manipulation of image and text, as well as the inherent many-to-many (N:N) nature of ITR, where a single image can be described in numerous ways, and vice versa. To this end, this paper studies defense strategies against adversarial attacks on VL models for ITR for the first time. Particularly, we focus on how to leverage the N:N relationship in ITR to enhance adversarial robustness. We found that, although adversarial training easily overfits to specific one-to-one (1:1) image-text pairs in the train data, diverse augmentation techniques to create one-to-many (1:N) / many-to-one (N:1) image-text pairs can significantly improve adversarial robustness in VL models. Additionally, we show that the alignment of the augmented image-text pairs is crucial for the effectiveness of the defense strategy, and that inappropriate augmentations can even degrade the model's performance. Based on these findings, we propose a novel defense strategy that leverages the N:N relationship in ITR, which effectively generates diverse yet highly-aligned N:N pairs using basic augmentations and generative model-based augmentations. This work provides a novel perspective on defending against adversarial attacks in VL tasks and opens up new research directions for future work.
Problem

Research questions and friction points this paper is trying to address.

Defending vision-language models against multimodal adversarial attacks
Addressing one-to-many relationships in image-text pairs for robustness
Enhancing adversarial training with diverse and well-aligned augmentations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal adversarial training for VL models
Leverages one-to-many image-text relationships
Enhances robustness with diverse augmentation techniques
🔎 Similar Papers
No similar papers found.