🤖 AI Summary
This study addresses the alarming stealth of document images generated by GPT-Image-2, which evade detection by both humans and state-of-the-art forensic methods—including the model’s own zero-shot self-detection capability. The authors introduce AIForge-Doc v2, a benchmark dataset enabling the first systematic evaluation of detection performance across human observers (via double-blind trials), general-purpose forensic models (TruFor), document-specific detectors (DocTamper), and GPT-Image-2’s self-diagnosis strategy. Results reveal near-random human accuracy (0.501), while TruFor and DocTamper achieve AUCs of only 0.599 and 0.585, respectively; self-detection performs even worse, with an AUC below 0.59—substantially lower than their efficacy on conventional tampering tasks. These findings underscore a critical vulnerability in current detection frameworks and motivate the proposal of a dedicated evaluation protocol and calibration methodology for AI-generated document forgery.
📝 Abstract
OpenAI's GPT-Image-2 has effectively erased the visual boundary between authentic and AI-edited document images: a single number on a receipt can be replaced in under a second for a few cents. We release AIForge-Doc v2, a paired dataset of 3,066 GPT-Image-2 document forgeries with pixel-precise masks in DocTamper-compatible format, and benchmark four lines of defence: human inspectors (N=120, n=365 pair-votes via the public 2AFC site CanUSpotAI.com), TruFor (generic forensic), DocTamper (qcf-568, document-specific), and the same GPT-Image-2 model as a zero-shot self-judge -- asked, to avoid the trivial "image is mostly real" reading, whether any region was generated or edited by an AI image model. Human 2AFC accuracy is 0.501, indistinguishable from chance: even side-by-side, inspectors cannot tell GPT-Image-2 receipt forgeries from authentic counterparts. The three computational judges sit only modestly above (TruFor 0.599, DocTamper 0.585, self-judge 0.532). The self-judge fails consistently, not by chance: across five prompt strategies and four policies for handling ambiguous responses, AUC never rises above 0.59. To rule out the possibility that the two forensic detectors are broken on our source domain rather than blind to AI inpainting, we calibrate each on a same-domain traditional-tampering set built for its training distribution: TruFor reaches AUC 0.962 on cross-camera splicing of our dataset, DocTamper reaches 0.852 on cross-document OCR-token splicing with two-pass JPEG re-encoding. Both retain near-published performance on traditional tampering; switching to GPT-Image-2 inpainting drops AUC by 0.27-0.36 (0.962->0.599 TruFor; 0.852->0.585 DocTamper), isolating a detection gap specific to GPT-Image-2 inpainting. We release the dataset, pipeline, four-judge protocol, and calibration sets.