🤖 AI Summary
Current AI-generated image detection methods suffer from limited generalizability and robustness, heavily relying on low-level artifacts and model-specific features. To address this, we propose RAVID—the first framework to introduce Retrieval-Augmented Generation (RAG) into visual authenticity detection. RAVID employs a category-prompt-optimized CLIP encoder to dynamically retrieve semantically relevant images and integrates multimodal knowledge via vision-language models (e.g., Qwen-VL) for input enhancement and joint reasoning. Its core innovation lies in the first adaptation of retrieval-augmented mechanisms to image forgery detection, establishing a knowledge-driven discriminative paradigm. Evaluated on UniversalFakeDetect—a comprehensive benchmark spanning 19 generative models—RAVID achieves a mean accuracy of 93.85%. Under challenging degradations including noise and compression, it maintains 80.27% accuracy, significantly outperforming state-of-the-art methods.
📝 Abstract
In this paper, we introduce RAVID, the first framework for AI-generated image detection that leverages visual retrieval-augmented generation (RAG). While RAG methods have shown promise in mitigating factual inaccuracies in foundation models, they have primarily focused on text, leaving visual knowledge underexplored. Meanwhile, existing detection methods, which struggle with generalization and robustness, often rely on low-level artifacts and model-specific features, limiting their adaptability. To address this, RAVID dynamically retrieves relevant images to enhance detection. Our approach utilizes a fine-tuned CLIP image encoder, RAVID CLIP, enhanced with category-related prompts to improve representation learning. We further integrate a vision-language model (VLM) to fuse retrieved images with the query, enriching the input and improving accuracy. Given a query image, RAVID generates an embedding using RAVID CLIP, retrieves the most relevant images from a database, and combines these with the query image to form an enriched input for a VLM (e.g., Qwen-VL or Openflamingo). Experiments on the UniversalFakeDetect benchmark, which covers 19 generative models, show that RAVID achieves state-of-the-art performance with an average accuracy of 93.85%. RAVID also outperforms traditional methods in terms of robustness, maintaining high accuracy even under image degradations such as Gaussian blur and JPEG compression. Specifically, RAVID achieves an average accuracy of 80.27% under degradation conditions, compared to 63.44% for the state-of-the-art model C2P-CLIP, demonstrating consistent improvements in both Gaussian blur and JPEG compression scenarios. The code will be publicly available upon acceptance.