🤖 AI Summary
This work identifies a novel data supply-chain poisoning vector for text-to-image (T2I) generation models, wherein the adversarial fragility of vision-language models (VLMs) in image captioning is exploited to inject stealthy “dirty labels.” We propose a black-box adversarial mislabeling attack: by feeding perturbed images to commercial VLMs (e.g., Google Vertex AI, Microsoft Azure), we induce semantically plausible yet incorrect captions, thereby constructing highly inconspicuous poisoned training samples. Crucially, injecting fewer than 0.1% corrupted samples suffices to significantly degrade downstream T2I model fidelity. We provide the first systematic empirical validation of this attack on real-world, production-grade VLMs, achieving over 73% success rate and substantial distortion in generated outputs. Our findings expose VLMs—when deployed as automated annotators—as critical security bottlenecks in multimodal model development pipelines, offering both new insights and concrete evidence for securing multimodal AI supply chains.
📝 Abstract
Today's text-to-image generative models are trained on millions of images sourced from the Internet, each paired with a detailed caption produced by Vision-Language Models (VLMs). This part of the training pipeline is critical for supplying the models with large volumes of high-quality image-caption pairs during training. However, recent work suggests that VLMs are vulnerable to stealthy adversarial attacks, where adversarial perturbations are added to images to mislead the VLMs into producing incorrect captions.
In this paper, we explore the feasibility of adversarial mislabeling attacks on VLMs as a mechanism to poisoning training pipelines for text-to-image models. Our experiments demonstrate that VLMs are highly vulnerable to adversarial perturbations, allowing attackers to produce benign-looking images that are consistently miscaptioned by the VLM models. This has the effect of injecting strong "dirty-label" poison samples into the training pipeline for text-to-image models, successfully altering their behavior with a small number of poisoned samples. We find that while potential defenses can be effective, they can be targeted and circumvented by adaptive attackers. This suggests a cat-and-mouse game that is likely to reduce the quality of training data and increase the cost of text-to-image model development. Finally, we demonstrate the real-world effectiveness of these attacks, achieving high attack success (over 73%) even in black-box scenarios against commercial VLMs (Google Vertex AI and Microsoft Azure).