🤖 AI Summary
To address the growing credibility crisis in video forensics caused by the proliferation of AI-generated videos, this work moves beyond conventional black-box binary classification. Methodologically, we introduce DAVID-XR1, an interpretable video-language model that jointly performs fine-grained spatiotemporal artifact localization, multi-step visual reasoning, and natural language explanation generation—enabling a paradigm shift from “Is this video fake?” to “Why and where is it fake?”. We construct DAVID-X, the first benchmark dataset featuring spatiotemporal artifact-level annotations and human-annotated natural language rationales. Key technical innovations include chain-of-thought distillation and joint artifact classification-localization training. Experiments demonstrate state-of-the-art performance under cross-generator and cross-modal generalization settings: fine-tuning only a generic vision backbone yields a 12.7% absolute accuracy gain in forgery detection, while explanation faithfulness reaches 89.4%.
📝 Abstract
As AI-generated video becomes increasingly pervasive across media platforms, the ability to reliably distinguish synthetic content from authentic footage has become both urgent and essential. Existing approaches have primarily treated this challenge as a binary classification task, offering limited insight into where or why a model identifies a video as AI-generated. However, the core challenge extends beyond simply detecting subtle artifacts; it requires providing fine-grained, persuasive evidence that can convince auditors and end-users alike. To address this critical gap, we introduce DAVID-X, the first dataset to pair AI-generated videos with detailed defect-level, temporal-spatial annotations and written rationales. Leveraging these rich annotations, we present DAVID-XR1, a video-language model designed to deliver an interpretable chain of visual reasoning-including defect categorization, temporal-spatial localization, and natural language explanations. This approach fundamentally transforms AI-generated video detection from an opaque black-box decision into a transparent and verifiable diagnostic process. We demonstrate that a general-purpose backbone, fine-tuned on our compact dataset and enhanced with chain-of-thought distillation, achieves strong generalization across a variety of generators and generation modes. Our results highlight the promise of explainable detection methods for trustworthy identification of AI-generated video content.