🤖 AI Summary
AI-generated images—particularly deepfakes—pose severe threats to multimedia forensics, misinformation detection, and biometric authentication, exacerbating fraud and social engineering risks. Existing detection methods suffer from three key limitations: (i) non-standardized benchmark datasets, (ii) inconsistent training protocols (e.g., end-to-end training, feature freezing, or fine-tuning applied indiscriminately), and (iii) narrow evaluation metrics lacking generalization assessment and interpretability analysis. To address these issues, we propose the first systematic, reproducible benchmark framework for evaluating AI-generated image detection. It uniformly assesses ten state-of-the-art methods across seven diverse datasets spanning GAN- and diffusion-based generators. Our framework introduces multi-dimensional quantitative metrics—including ROC-AUC, class-wise sensitivity, and error rates—alongside interpretability analyses via Grad-CAM and confidence calibration curves. Crucially, we empirically reveal a significant performance gap between in-distribution accuracy and cross-model generalization capability—a previously uncharacterized limitation. This work establishes an empirical foundation and principled methodology for developing robust, interpretable detection systems.
📝 Abstract
The threats posed by AI-generated media, particularly deepfakes, are now raising significant challenges for multimedia forensics, misinformation detection, and biometric system resulting in erosion of public trust in the legal system, significant increase in frauds, and social engineering attacks. Although several forensic methods have been proposed, they suffer from three critical gaps: (i) use of non-standardized benchmarks with GAN- or diffusion-generated images, (ii) inconsistent training protocols (e.g., scratch, frozen, fine-tuning), and (iii) limited evaluation metrics that fail to capture generalization and explainability. These limitations hinder fair comparison, obscure true robustness, and restrict deployment in security-critical applications. This paper introduces a unified benchmarking framework for systematic evaluation of forensic methods under controlled and reproducible conditions. We benchmark ten SoTA forensic methods (scratch, frozen, and fine-tuned) and seven publicly available datasets (GAN and diffusion) to perform extensive and systematic evaluations. We evaluate performance using multiple metrics, including accuracy, average precision, ROC-AUC, error rate, and class-wise sensitivity. We also further analyze model interpretability using confidence curves and Grad-CAM heatmaps. Our evaluations demonstrate substantial variability in generalization, with certain methods exhibiting strong in-distribution performance but degraded cross-model transferability. This study aims to guide the research community toward a deeper understanding of the strengths and limitations of current forensic approaches, and to inspire the development of more robust, generalizable, and explainable solutions.