🤖 AI Summary
This work addresses the vulnerability of vision-language models (VLMs) to malicious prompt attacks, a critical security concern exacerbated by the limitations of existing defenses in both efficiency and robustness. To tackle this challenge, the authors propose the Multimodal Aggregated Feature Extraction (MAFE) framework, which extends CLIP to effectively handle long textual inputs and integrate multimodal information. Leveraging MAFE, they uncover—for the first time—the distinct distributional differences between benign and malicious prompts in the feature space. Building upon this insight, they design VLMShield, a lightweight, plug-and-play security detector that significantly enhances detection efficiency and robustness without compromising the original model performance. VLMShield supports flexible deployment across diverse scenarios and consistently outperforms current state-of-the-art methods.
📝 Abstract
Vision-Language Models (VLMs) face significant safety vulnerabilities from malicious prompt attacks due to weakened alignment during visual integration. Existing defenses suffer from efficiency and robustness. To address these challenges, we first propose the Multimodal Aggregated Feature Extraction (MAFE) framework that enables CLIP to handle long text and fuse multimodal information into unified representations. Through empirical analysis of MAFE-extracted features, we discover distinct distributional patterns between benign and malicious prompts. Building upon this finding, we develop VLMShield, a lightweight safety detector that efficiently identifies multimodal malicious attacks as a plug-and-play solution. Extensive experiments demonstrate superior performance across multiple dimensions, including robustness, efficiency, and utility. Through our work, we hope to pave the way for more secure multimodal AI deployment. Code is available at [this https URL](https://github.com/pgqihere/VLMShield).