🤖 AI Summary
This work addresses the emerging threat posed by highly realistic, fully synthetic videos generated by open-source text-to-video models—a challenge inadequately covered by existing deepfake detection benchmarks focused primarily on facial manipulations. To bridge this gap, the authors introduce the first human-centric detection benchmark dedicated to purely synthetic videos, comprising 6,815 high-quality samples from five leading open-source text-to-video generators. Rigorous two-stage human verification ensures both semantic coherence and visual fidelity, while four compression levels enable robustness evaluation under real-world conditions. Notably, the benchmark incorporates multiple generators, varied compression settings, and comprehensive metadata, filling a critical void in synthetic video forensics within open ecosystems. Experiments reveal that current detectors suffer an average AUC drop of 29.19% on this benchmark—some performing below random chance—whereas models trained on it achieve 93.81% AUC against unseen generators, albeit with markedly reduced generalization to traditional face-manipulation deepfakes.
📝 Abstract
The landscape of synthetic media has been irrevocably altered by text-to-video (T2V) models, whose outputs are rapidly approaching indistinguishability from reality. Critically, this technology is no longer confined to large-scale labs; the proliferation of efficient, open-source generators is democratizing the ability to create high-fidelity synthetic content on consumer-grade hardware. This makes existing face-centric and manipulation-based benchmarks obsolete. To address this urgent threat, we introduce SynthForensics, to the best of our knowledge the first human-centric benchmark for detecting purely synthetic video deepfakes. The benchmark comprises 6,815 unique videos from five architecturally distinct, state-of-the-art open-source T2V models. Its construction was underpinned by a meticulous two-stage, human-in-the-loop validation to ensure high semantic and visual quality. Each video is provided in four versions (raw, lossless, light, and heavy compression) to enable real-world robustness testing. Experiments demonstrate that state-of-the-art detectors are both fragile and exhibit limited generalization when evaluated on this new domain: we observe a mean performance drop of $29.19\%$ AUC, with some methods performing worse than random chance, and top models losing over 30 points under heavy compression. The paper further investigates the efficacy of training on SynthForensics as a means to mitigate these observed performance gaps, achieving robust generalization to unseen generators ($93.81\%$ AUC), though at the cost of reduced backward compatibility with traditional manipulation-based deepfakes. The complete dataset and all generation metadata, including the specific prompts and inference parameters for every video, will be made publicly available at [link anonymized for review].