🤖 AI Summary
AIGC-generated images frequently exhibit semantic anomalies—such as implausible object configurations or violations of physical laws and commonsense knowledge—undermining content credibility. To address this, we introduce the novel task of *semantic anomaly detection and attribution*, proposing the first large-scale, fine-grained benchmark, AnomReason, and a multi-agent reasoning framework, AnomAgent, capable of precise anomaly localization, root-cause attribution, and severity assessment. Our method employs a modular multi-agent pipeline: GPT-4o generates structured quadruple annotations (object, relation, context, violation), validated via lightweight human verification; we further propose semantic-aware evaluation metrics—SemAP and SemF1—to quantify alignment with human judgment. Fine-tuned models achieve significant gains over strong baselines in detection accuracy. The framework has been successfully deployed in explainable deepfake detection and semantic fidelity evaluation of generative models, advancing reproducible and interpretable research on AIGC semantic authenticity.
📝 Abstract
The rapid advancement of
AI-generated content (AIGC) has enabled the synthesis of visually convincing images; however, many such outputs exhibit subtle extbf{semantic anomalies}, including unrealistic object configurations, violations of physical laws, or commonsense inconsistencies, which compromise the overall plausibility of the generated scenes. Detecting these semantic-level anomalies
is essential for assessing the trustworthiness of AIGC media, especially in AIGC image analysis, explainable deepfake detection and semantic authenticity assessment. In this paper,
we formalize extbf{semantic anomaly detection and reasoning} for AIGC images and
introduce extbf{AnomReason}, a large-scale benchmark with structured annotations as quadruples emph{(Name, Phenomenon, Reasoning, Severity)}. Annotations are produced by
a modular multi-agent pipeline ( extbf{AnomAgent}) with lightweight human-in-the-loop verification, enabling scale while preserving quality.
At construction time, AnomAgent processed approximately 4.17,B GPT-4o tokens, providing scale evidence for the resulting structured annotations. We further
show that models fine-tuned on AnomReason achieve consistent gains over strong vision-language baselines under our proposed semantic matching metric ( extit{SemAP} and extit{SemF1}).
Applications to {explainable deepfake detection} and {semantic reasonableness assessment of image generators} demonstrate practical utility. In summary, AnomReason and AnomAgent
serve as a foundation for measuring and improving the semantic plausibility of AI-generated images. We will release code, metrics, data, and task-aligned models to support reproducible research on semantic authenticity and interpretable AIGC forensics.