When Seeing Is Not Believing -- A Benchmark for Search-Grounded Video Misinformation Detection

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the challenge of detecting video disinformation, which often constructs deceptive narratives through selective editing, cross-source splicing, or AI generation—manipulations that cannot be reliably identified from a single video alone. To this end, the paper introduces EVID-Bench, the first benchmark for video disinformation detection that necessitates open-web retrieval and cross-video verification. It comprises 222 verified samples across three categories and nine manipulation types, all deliberately designed to be indistinguishable based solely on visual content. Evaluation using a retrieval-augmented verification framework reveals that state-of-the-art multimodal systems achieve only 61.43% point-level and 43.24% video-level accuracy, with particularly poor performance on AI-generated manipulations. Common failure modes include anchor misidentification and premature search termination, underscoring the task’s complexity and its significance for future research.

📝 Abstract

Video misinformation increasingly operates at the semantic and evidential level: authentic footage may be selectively edited, temporally reordered, spliced across sources, or augmented with AI-generated content to construct false narratives. Such evidence-dependent manipulations cannot be reliably verified from the input video alone, because the missing, reordered, replaced, or recontextualized evidence lies outside the video itself. We introduce \textbf{EVID-Bench}, a benchmark for search-grounded video misinformation detection, where a system must search the open web for related videos and identify what information is false through cross-video comparison. EVID-Bench comprises 222 videos spanning 9 manipulation types across 3 categories: AI generation, single-source editing, and multi-source editing. All samples are verified to be undetectable by frontier models through visual inspection alone. We evaluate nine frontier multimodal models using a retrieval-augmented verification baseline. The best system achieves only 61.43\% point-level accuracy and 43.24\% video-level accuracy, while AI-generated manipulations remain especially challenging. Error analysis reveals recurring challenges: models fixate on irrelevant anchors, misattribute synthetic content to editorial splicing, and terminate search prematurely before fully explaining the manipulation.

Problem

Research questions and friction points this paper is trying to address.

video misinformation

evidence-dependent manipulation

search-grounded detection

AI-generated content

cross-video verification

Innovation

Methods, ideas, or system contributions that make the work stand out.

search-grounded verification

video misinformation detection

evidence-dependent manipulation