🤖 AI Summary
This work addresses the challenge of detecting video disinformation, which often constructs deceptive narratives through selective editing, cross-source splicing, or AI generation—manipulations that cannot be reliably identified from a single video alone. To this end, the paper introduces EVID-Bench, the first benchmark for video disinformation detection that necessitates open-web retrieval and cross-video verification. It comprises 222 verified samples across three categories and nine manipulation types, all deliberately designed to be indistinguishable based solely on visual content. Evaluation using a retrieval-augmented verification framework reveals that state-of-the-art multimodal systems achieve only 61.43% point-level and 43.24% video-level accuracy, with particularly poor performance on AI-generated manipulations. Common failure modes include anchor misidentification and premature search termination, underscoring the task’s complexity and its significance for future research.
📝 Abstract
Video misinformation increasingly operates at the semantic and evidential level: authentic footage may be selectively edited, temporally reordered, spliced across sources, or augmented with AI-generated content to construct false narratives. Such evidence-dependent manipulations cannot be reliably verified from the input video alone, because the missing, reordered, replaced, or recontextualized evidence lies outside the video itself. We introduce \textbf{EVID-Bench}, a benchmark for search-grounded video misinformation detection, where a system must search the open web for related videos and identify what information is false through cross-video comparison. EVID-Bench comprises 222 videos spanning 9 manipulation types across 3 categories: AI generation, single-source editing, and multi-source editing. All samples are verified to be undetectable by frontier models through visual inspection alone. We evaluate nine frontier multimodal models using a retrieval-augmented verification baseline. The best system achieves only 61.43\% point-level accuracy and 43.24\% video-level accuracy, while AI-generated manipulations remain especially challenging. Error analysis reveals recurring challenges: models fixate on irrelevant anchors, misattribute synthetic content to editorial splicing, and terminate search prematurely before fully explaining the manipulation.