🤖 AI Summary
To address the lack of interpretability and multimodal misinformative segment localization in online false information video detection, this paper introduces the GroundMM task: joint localization of misleading segments across textual, audio, and visual modalities. To support this task, we present GroundLie360—the first real-world, multimodal misinformation dataset featuring fine-grained spatiotemporal annotations, misinformation type categorization, and a fact-checking evidence–based verification mechanism. We further propose FakeMark, a question-answering–driven vision-language model baseline that integrates unimodal and cross-modal cues to achieve interpretable detection and precise localization. Experiments demonstrate the task’s substantial difficulty. Together, GroundLie360 and FakeMark constitute the first benchmark framework dedicated to explainable, multimodal false information localization, advancing evaluation paradigms and methodological development in this emerging field.
📝 Abstract
The proliferation of online misinformation videos poses serious societal risks. Current datasets and detection methods primarily target binary classification or single-modality localization based on post-processed data, lacking the interpretability needed to counter persuasive misinformation. In this paper, we introduce the task of Grounding Multimodal Misinformation (GroundMM), which verifies multimodal content and localizes misleading segments across modalities. We present the first real-world dataset for this task, GroundLie360, featuring a taxonomy of misinformation types, fine-grained annotations across text, speech, and visuals, and validation with Snopes evidence and annotator reasoning. We also propose a VLM-based, QA-driven baseline, FakeMark, using single- and cross-modal cues for effective detection and grounding. Our experiments highlight the challenges of this task and lay a foundation for explainable multimodal misinformation detection.