A New Dataset and Benchmark for Grounding Multimodal Misinformation

📅 2025-09-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of interpretability and multimodal misinformative segment localization in online false information video detection, this paper introduces the GroundMM task: joint localization of misleading segments across textual, audio, and visual modalities. To support this task, we present GroundLie360—the first real-world, multimodal misinformation dataset featuring fine-grained spatiotemporal annotations, misinformation type categorization, and a fact-checking evidence–based verification mechanism. We further propose FakeMark, a question-answering–driven vision-language model baseline that integrates unimodal and cross-modal cues to achieve interpretable detection and precise localization. Experiments demonstrate the task’s substantial difficulty. Together, GroundLie360 and FakeMark constitute the first benchmark framework dedicated to explainable, multimodal false information localization, advancing evaluation paradigms and methodological development in this emerging field.

Technology Category

Application Category

📝 Abstract
The proliferation of online misinformation videos poses serious societal risks. Current datasets and detection methods primarily target binary classification or single-modality localization based on post-processed data, lacking the interpretability needed to counter persuasive misinformation. In this paper, we introduce the task of Grounding Multimodal Misinformation (GroundMM), which verifies multimodal content and localizes misleading segments across modalities. We present the first real-world dataset for this task, GroundLie360, featuring a taxonomy of misinformation types, fine-grained annotations across text, speech, and visuals, and validation with Snopes evidence and annotator reasoning. We also propose a VLM-based, QA-driven baseline, FakeMark, using single- and cross-modal cues for effective detection and grounding. Our experiments highlight the challenges of this task and lay a foundation for explainable multimodal misinformation detection.
Problem

Research questions and friction points this paper is trying to address.

Detecting and localizing misleading segments in multimodal misinformation videos
Addressing lack of interpretability in current misinformation detection methods
Verifying multimodal content across text, speech, and visual modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

VLM-based QA-driven baseline FakeMark
Cross-modal cues for detection grounding
Real-world dataset GroundLie360 annotations
B
Bingjian Yang
National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University
Danni Xu
Danni Xu
NUS
misinformation LMMs
K
Kaipeng Niu
National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University
W
Wenxuan Liu
School of Computer Science, Peking University; State Key Laboratory for Multimedia Information Processing, Peking University
Z
Zheng Wang
National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University
M
Mohan Kankanhalli
School of Computing, National University of Singapore