๐ค AI Summary
To address the real-time detection challenge of local adversarial patch attacks in vision systems, this paper proposes Vision Retrieval-Augmented Generation (VRAG), a training-free and fine-tuning-free framework. VRAG integrates multiple heterogeneous vision-language models (VLMs) and enables zero-shot generative discriminative reasoning for attack identification. It establishes the first training-free paradigm for adversarial patch detection. We construct a scalable visual attack database, enabling plug-and-play compatibility with both open- and closed-source large VLMsโincluding Qwen-VL-Plus, UI-TARS-72B-DPO, and Gemini-2.0. Experiments demonstrate that UI-TARS-72B-DPO achieves 95% accuracy (state-of-the-art among open-source models), while Gemini-2.0 attains the highest accuracy of 98%. The approach significantly reduces reliance on manual annotation and exhibits strong generalization, effectively detecting diverse novel adversarial patches across unseen attack types and domains.
๐ Abstract
Adversarial patch attacks pose a major threat to vision systems by embedding localized perturbations that mislead deep models. Traditional defense methods often require retraining or fine-tuning, making them impractical for real-world deployment. We propose a training-free Visual Retrieval-Augmented Generation (VRAG) framework that integrates Vision-Language Models (VLMs) for adversarial patch detection. By retrieving visually similar patches and images that resemble stored attacks in a continuously expanding database, VRAG performs generative reasoning to identify diverse attack types, all without additional training or fine-tuning. We extensively evaluate open-source large-scale VLMs, including Qwen-VL-Plus, Qwen2.5-VL-72B, and UI-TARS-72B-DPO, alongside Gemini-2.0, a closed-source model. Notably, the open-source UI-TARS-72B-DPO model achieves up to 95 percent classification accuracy, setting a new state-of-the-art for open-source adversarial patch detection. Gemini-2.0 attains the highest overall accuracy, 98 percent, but remains closed-source. Experimental results demonstrate VRAG's effectiveness in identifying a variety of adversarial patches with minimal human annotation, paving the way for robust, practical defenses against evolving adversarial patch attacks.