Improving Adversarial Transferability on Vision-Language Pre-training Models via Surrogate-Specific Bias Correction

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the poor transferability of adversarial attacks on vision-language pre-trained models, which often stems from overreliance on proxy models that introduce model-specific biases. To mitigate this, the authors propose DeBias-Attack, the first method to incorporate a proxy-specific bias correction mechanism. It employs a dual-perturbation architecture comprising a main branch and a reference branch, where a weak semantic reference image—constructed from dataset mean plus Gaussian noise—is used prior to gradient updates to estimate and project out model-dependent bias. Additionally, context-aware textual substitution is integrated to enhance semantic consistency. Extensive experiments demonstrate that DeBias-Attack significantly improves transfer-based attack success rates across diverse vision-language models, downstream tasks, and both open- and closed-source multimodal large language models, outperforming existing approaches.

📝 Abstract

Adversarial examples reveal vulnerabilities in Vision-Language Pre-training (VLP) models and provide insights for improving robustness. A key property is cross-model transferability, which enables transfer-based black-box attacks. However, existing attacks often rely heavily on the surrogate model, causing cross-model performance drops. One reason is that adversarial optimization may follow surrogate model responses more than input semantics, making the update direction effective on the surrogate but less transferable to unseen targets. We refer to this dependency as surrogate-specific bias. Motivated by this observation, DeBias-Attack improves transferability by correcting surrogate-specific bias in adversarial optimization directions. It maintains two perturbation branches. The main branch optimizes a perturbation on the original image and obtains the adversarial gradient used to disrupt image-text alignment. The reference branch optimizes a perturbation on a weak-semantic image constructed from the dataset mean image with small Gaussian noise resampled at each iteration. Since this weak-semantic image contains little clear visual content, its optimization reflects surrogate responses more than image semantics, and its reference gradient estimates surrogate-specific bias. DeBias-Attack removes the aligned projection of the main gradient on the reference gradient before updating the adversarial image, then performs context-aware text substitution using the updated adversarial image. DeBias-Attack is the first transfer-based VLP attack that corrects surrogate-specific bias through gradient correction. Experiments show strong performance across VLP models, downstream tasks, and open-source and closed-source multimodal large language models.

Problem

Research questions and friction points this paper is trying to address.

adversarial transferability

vision-language pre-training

surrogate-specific bias

black-box attack

cross-model generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

adversarial transferability

vision-language pre-training

surrogate-specific bias