RAVID: Retrieval-Augmented Visual Detection: A Knowledge-Driven Approach for AI-Generated Image Identification

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current AI-generated image detection methods suffer from limited generalizability and robustness, heavily relying on low-level artifacts and model-specific features. To address this, we propose RAVID—the first framework to introduce Retrieval-Augmented Generation (RAG) into visual authenticity detection. RAVID employs a category-prompt-optimized CLIP encoder to dynamically retrieve semantically relevant images and integrates multimodal knowledge via vision-language models (e.g., Qwen-VL) for input enhancement and joint reasoning. Its core innovation lies in the first adaptation of retrieval-augmented mechanisms to image forgery detection, establishing a knowledge-driven discriminative paradigm. Evaluated on UniversalFakeDetect—a comprehensive benchmark spanning 19 generative models—RAVID achieves a mean accuracy of 93.85%. Under challenging degradations including noise and compression, it maintains 80.27% accuracy, significantly outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

In this paper, we introduce RAVID, the first framework for AI-generated image detection that leverages visual retrieval-augmented generation (RAG). While RAG methods have shown promise in mitigating factual inaccuracies in foundation models, they have primarily focused on text, leaving visual knowledge underexplored. Meanwhile, existing detection methods, which struggle with generalization and robustness, often rely on low-level artifacts and model-specific features, limiting their adaptability. To address this, RAVID dynamically retrieves relevant images to enhance detection. Our approach utilizes a fine-tuned CLIP image encoder, RAVID CLIP, enhanced with category-related prompts to improve representation learning. We further integrate a vision-language model (VLM) to fuse retrieved images with the query, enriching the input and improving accuracy. Given a query image, RAVID generates an embedding using RAVID CLIP, retrieves the most relevant images from a database, and combines these with the query image to form an enriched input for a VLM (e.g., Qwen-VL or Openflamingo). Experiments on the UniversalFakeDetect benchmark, which covers 19 generative models, show that RAVID achieves state-of-the-art performance with an average accuracy of 93.85%. RAVID also outperforms traditional methods in terms of robustness, maintaining high accuracy even under image degradations such as Gaussian blur and JPEG compression. Specifically, RAVID achieves an average accuracy of 80.27% under degradation conditions, compared to 63.44% for the state-of-the-art model C2P-CLIP, demonstrating consistent improvements in both Gaussian blur and JPEG compression scenarios. The code will be publicly available upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

Detect AI-generated images using retrieval-augmented visual knowledge

Improve generalization and robustness in image detection methods

Enhance detection accuracy under image degradations like blur and compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-augmented generation for image detection

Fine-tuned CLIP encoder with category prompts

Vision-language model fuses retrieved and query images

🔎 Similar Papers

No similar papers found.

Authors to Follow