Multimodal Fact Checking with Unified Visual, Textual, and Contextual Representations

๐Ÿ“… 2025-08-07
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Traditional text-based fact-checking systems struggle to effectively verify multimodal misinformation involving both text and images. To address this, we propose MultiCheck, a fine-grained multimodal fact-checking framework that jointly models textual, visual, and contextual representations. MultiCheck explicitly captures cross-modal semantic alignments through an element-level cross-modal interaction mechanism and a contrastive learning objective, thereby enhancing both interpretability and generalization. The architecture comprises dedicated text and vision encoders, a cross-modal fusion module, and a classification head, with contrastive learning employed to optimize semantic alignment across modalities. Evaluated on the Factify-2 benchmark, MultiCheck achieves a weighted F1 score of 0.84โ€”significantly outperforming existing baselines. This work provides an interpretable and robust solution for multimodal fact-checking, advancing the state of the art in verifiable multimodal reasoning.

Technology Category

Application Category

๐Ÿ“ Abstract
The growing rate of multimodal misinformation, where claims are supported by both text and images, poses significant challenges to fact-checking systems that rely primarily on textual evidence. In this work, we have proposed a unified framework for fine-grained multimodal fact verification called "MultiCheck", designed to reason over structured textual and visual signals. Our architecture combines dedicated encoders for text and images with a fusion module that captures cross-modal relationships using element-wise interactions. A classification head then predicts the veracity of a claim, supported by a contrastive learning objective that encourages semantic alignment between claim-evidence pairs in a shared latent space. We evaluate our approach on the Factify 2 dataset, achieving a weighted F1 score of 0.84, substantially outperforming the baseline. These results highlight the effectiveness of explicit multimodal reasoning and demonstrate the potential of our approach for scalable and interpretable fact-checking in complex, real-world scenarios.
Problem

Research questions and friction points this paper is trying to address.

Detecting multimodal misinformation combining text and images
Improving fact-checking accuracy with unified visual-textual analysis
Enhancing cross-modal reasoning for scalable fact verification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for multimodal fact verification
Fusion module captures cross-modal relationships
Contrastive learning aligns claim-evidence pairs
A
Aditya Kishore
Department of Data Science and Engineering, Indian Institute of Science Education and Research, Bhopal, India
G
Gaurav Kumar
Department of Data Science and Engineering, Indian Institute of Science Education and Research, Bhopal, India
Jasabanta Patro
Jasabanta Patro
Assistant Professor, DSE, IISER Bhopal
NLPSocial Computing