🤖 AI Summary
This paper investigates the fundamental cause of divergent cross-model transferability of adversarial attacks in vision-language models (VLMs): why data-space attacks transfer robustly, whereas representation-space attacks—such as image jailbreaking—exhibit poor transferability.
Method: We propose the “attack scope determines transferability” theoretical framework, proving that representation-space attack transfer relies critically on geometric alignment of feature spaces, while data-space attacks inherently satisfy this condition. We systematically construct both attack types across image classification, language models, and VLMs, and introduce geometric alignment analysis in the projected feature space.
Contribution/Results: Our framework provides the first unified explanation of adversarial transfer across multimodal, vision, and language models. Experiments confirm that data-space attacks achieve stable transfer across VLMs; unaligned representation-space attacks fail, but regain transferability upon geometric alignment. This work establishes a novel theoretical foundation and practical guidance for robust multimodal modeling.
📝 Abstract
The field of adversarial robustness has long established that adversarial examples can successfully transfer between image classifiers and that text jailbreaks can successfully transfer between language models (LMs). However, a pair of recent studies reported being unable to successfully transfer image jailbreaks between vision-language models (VLMs). To explain this striking difference, we propose a fundamental distinction regarding the transferability of attacks against machine learning models: attacks in the input data-space can transfer, whereas attacks in model representation space do not, at least not without geometric alignment of representations. We then provide theoretical and empirical evidence of this hypothesis in four different settings. First, we mathematically prove this distinction in a simple setting where two networks compute the same input-output map but via different representations. Second, we construct representation-space attacks against image classifiers that are as successful as well-known data-space attacks, but fail to transfer. Third, we construct representation-space attacks against LMs that successfully jailbreak the attacked models but again fail to transfer. Fourth, we construct data-space attacks against VLMs that successfully transfer to new VLMs, and we show that representation space attacks emph{can} transfer when VLMs' latent geometries are sufficiently aligned in post-projector space. Our work reveals that adversarial transfer is not an inherent property of all attacks but contingent on their operational domain - the shared data-space versus models' unique representation spaces - a critical insight for building more robust models.