Improving Adversarial Transferability on Vision-Language Pre-training Models via Surrogate-Specific Bias Correction

📅 2026-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the poor transferability of adversarial attacks on vision-language pre-trained models, which often stems from overreliance on proxy models that introduce model-specific biases. To mitigate this, the authors propose DeBias-Attack, the first method to incorporate a proxy-specific bias correction mechanism. It employs a dual-perturbation architecture comprising a main branch and a reference branch, where a weak semantic reference image—constructed from dataset mean plus Gaussian noise—is used prior to gradient updates to estimate and project out model-dependent bias. Additionally, context-aware textual substitution is integrated to enhance semantic consistency. Extensive experiments demonstrate that DeBias-Attack significantly improves transfer-based attack success rates across diverse vision-language models, downstream tasks, and both open- and closed-source multimodal large language models, outperforming existing approaches.
📝 Abstract
Adversarial examples reveal vulnerabilities in Vision-Language Pre-training (VLP) models and provide insights for improving robustness. A key property is cross-model transferability, which enables transfer-based black-box attacks. However, existing attacks often rely heavily on the surrogate model, causing cross-model performance drops. One reason is that adversarial optimization may follow surrogate model responses more than input semantics, making the update direction effective on the surrogate but less transferable to unseen targets. We refer to this dependency as surrogate-specific bias. Motivated by this observation, DeBias-Attack improves transferability by correcting surrogate-specific bias in adversarial optimization directions. It maintains two perturbation branches. The main branch optimizes a perturbation on the original image and obtains the adversarial gradient used to disrupt image-text alignment. The reference branch optimizes a perturbation on a weak-semantic image constructed from the dataset mean image with small Gaussian noise resampled at each iteration. Since this weak-semantic image contains little clear visual content, its optimization reflects surrogate responses more than image semantics, and its reference gradient estimates surrogate-specific bias. DeBias-Attack removes the aligned projection of the main gradient on the reference gradient before updating the adversarial image, then performs context-aware text substitution using the updated adversarial image. DeBias-Attack is the first transfer-based VLP attack that corrects surrogate-specific bias through gradient correction. Experiments show strong performance across VLP models, downstream tasks, and open-source and closed-source multimodal large language models.
Problem

Research questions and friction points this paper is trying to address.

adversarial transferability
vision-language pre-training
surrogate-specific bias
black-box attack
cross-model generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

adversarial transferability
vision-language pre-training
surrogate-specific bias
gradient correction
black-box attack