Pay Less Attention to Function Words for Free Robustness of Vision-Language Models

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work identifies a critical vulnerability of vision-language models (VLMs) under cross-modal adversarial attacks: over-reliance on function words. To address this, we propose the first fine-tuning-free differential attention mechanism (FDA), which explicitly models the discrepancy between original and function-word-augmented cross-attention within multi-head attention layers, enabling lightweight desensitization to function words. FDA is architecture-agnostic and seamlessly integrates into mainstream VLMs. Evaluated on image-text retrieval and visual grounding tasks, it reduces attack success rates by 18–53% on average; notably, on visual grounding, it achieves a 90% attack resistance rate while marginally improving accuracy by 0.2%. Our core contributions are: (i) the first identification of function words as a fundamental bottleneck for cross-modal adversarial robustness; and (ii) a plug-and-play, zero-training-overhead solution for enhancing VLM robustness without architectural or training modifications.

Technology Category

Application Category

📝 Abstract

To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the impact of function words. Similar to differential amplifiers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more aligned and robust VLMs. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53% ASR drop with only 0.2/0.3/0.6% performance drops on the 3 tested models on retrieval, and a 90% ASR drop with a 0.3% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot performance of FDA experimentally, as well as in-depth ablation studies and analysis. Code will be made publicly at https://github.com/michaeltian108/FDA.

Problem

Research questions and friction points this paper is trying to address.

Enhances vision-language model robustness against adversarial attacks

Reduces vulnerability from function words in cross-modal interactions

Improves alignment and performance with minimal accuracy trade-offs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Function-word De-Attention reduces vulnerability to adversarial attacks

Differential subtraction of function-word cross-attention enhances model robustness

Method achieves significant attack success rate drop with minimal performance loss

🔎 Similar Papers

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts