🤖 AI Summary
This work addresses the challenges posed by the proliferation of AI-generated images and the limitations of existing detection methods, which often rely on large-scale training data and exhibit poor generalization. To overcome these issues, the authors propose a lightweight detection framework based on a frozen multimodal vision encoder. Leveraging the inherent separability between real and synthetic images in the encoder’s embedding space, the method employs a representation-aware data selection strategy to construct a highly efficient training set using only 10K samples. A simple linear classifier trained on this curated set achieves robust detection performance. Evaluated on benchmarks such as RealWorldBench, the approach significantly outperforms prior methods like AIGIBench and OpenFake in generalizing to unseen generators and distribution shifts, despite using substantially less training data.
📝 Abstract
The rapid advancement of generative models has blurred the boundary between synthetic and real imagery, creating an urgent need for reliable deepfake detection. Yet most existing approaches rely on massive real--fake datasets, which are increasingly difficult to maintain as new generators continue to emerge. In this work, we investigate how much information about image authenticity is already encoded in modern multimodal vision representations. We find that frozen multimodal encoders naturally separate real and synthetic images in their embedding space, enabling a simple linear classifier to achieve strong performance without task-specific fine-tuning. Motivated by this observation, we develop a representation-aware data curation strategy that selects a compact set of representative generators for training. The resulting training set contains only 10K images, compared to 288K in AIGIBench and 4M in OpenFake, while improving robustness to unseen generators and distribution shifts. We additionally introduce RealWorldBench, a benchmark consisting of modern camera photographs, contemporary stock images, and outputs from recent commercial generators. Experiments across multiple benchmarks show that combining frozen multimodal representations with carefully curated training data provides a simple and effective approach to AI-generated image detection.