🤖 AI Summary
Existing fake news detection methods struggle to jointly model complex interactions among textual misinformation, manipulated images, and external knowledge—particularly overlooking fine-grained visual details and lacking entity-level knowledge-guided semantic reasoning. To address this, we propose a knowledge-guided multimodal reasoning framework: (1) bottom-up attention extracts fine-grained visual objects, followed by cross-modal encoding via CLIP and RoBERTa; (2) an entity-level explicit selection mechanism coupled with natural language inference (NLI)-driven knowledge graph filtering enables semantic anchoring and verification; and (3) a Transformer-based classifier performs joint semantic–structural modeling. Our method achieves significant improvements over state-of-the-art approaches across multiple benchmark datasets, demonstrating the effectiveness of adaptive neighborhood knowledge selection and multimodal semantic alignment. The source code is publicly available.
📝 Abstract
Fake news detection remains a challenging problem due to the complex interplay between textual misinformation, manipulated images, and external knowledge reasoning. While existing approaches have achieved notable results in verifying veracity and cross-modal consistency, two key challenges persist: (1) Existing methods often consider only the global image context while neglecting local object-level details, and (2) they fail to incorporate external knowledge and entity relationships for deeper semantic understanding. To address these challenges, we propose a novel multi-modal fake news detection framework that integrates visual, textual, and knowledge-based representations. Our approach leverages bottom-up attention to capture fine-grained object details, CLIP for global image semantics, and RoBERTa for context-aware text encoding. We further enhance knowledge utilization by retrieving and adaptively selecting relevant entities from a knowledge graph. The fused multi-modal features are processed through a Transformer-based classifier to predict news veracity. Experimental results demonstrate that our model outperforms recent approaches, showcasing the effectiveness of neighbor selection mechanism and multi-modal fusion for fake news detection. Our proposal introduces a new paradigm: knowledge-grounded multimodal reasoning. By integrating explicit entity-level selection and NLI-guided filtering, we shift fake news detection from feature fusion to semantically grounded verification. For reproducibility and further research, the source code is publicly at href{https://github.com/latuanvinh1998/KGAlign}{github.com/latuanvinh1998/KGAlign}.