🤖 AI Summary
This work addresses the problem of detecting AI-generated images. We propose a robust detection method based on fine-tuning Vision Transformers (ViT), specifically designed to generalize across diverse generative models. Our approach introduces, for the first time on the De-Factify-4.0 multi-source synthetic image dataset, a systematic integration of heterogeneous perturbations—including geometric transformations, Gaussian noise, and JPEG compression—as data augmentation strategies. Leveraging transfer learning and end-to-end fine-tuning, the model achieves state-of-the-art performance on both validation and test sets, outperforming existing methods in accuracy and F1-score. Crucially, our results empirically validate that synergistic optimization of the ViT architecture with composite data augmentation significantly enhances cross-generator generalization and robustness—particularly against outputs from leading diffusion models such as Stable Diffusion, DALL·E 3, and MidJourney.
📝 Abstract
The aim of this work is to explore the potential of pre-trained vision-language models, e.g. Vision Transformers (ViT), enhanced with advanced data augmentation strategies for the detection of AI-generated images. Our approach leverages a fine-tuned ViT model trained on the Defactify-4.0 dataset, which includes images generated by state-of-the-art models such as Stable Diffusion 2.1, Stable Diffusion XL, Stable Diffusion 3, DALL-E 3, and MidJourney. We employ perturbation techniques like flipping, rotation, Gaussian noise injection, and JPEG compression during training to improve model robustness and generalisation. The experimental results demonstrate that our ViT-based pipeline achieves state-of-the-art performance, significantly outperforming competing methods on both validation and test datasets.