KITE: A Tri-Modal Transformer Integrating Text, Images, and Knowledge Graphs for Fake News Detection

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing approaches struggle to effectively detect multimodal fake news that combines deceptive text, manipulated images, and factual inaccuracies, particularly due to their limited ability to capture cross-modal semantic inconsistencies. This work proposes the first end-to-end trimodal detection framework that jointly models text (via RoBERTa), images (via CLIP), and external knowledge from Wikidata—encoded using a graph attention network. A multimodal Transformer enables deep fusion of these modalities, while a novel cross-modal attention mechanism enhances the model’s capacity to identify contradictions between text and image as well as conflicts with factual knowledge. The proposed method significantly outperforms unimodal and bimodal baselines and provides interpretable modality-wise confidence scores, demonstrating its effectiveness and innovation in complex fake news detection scenarios.

📝 Abstract

Traditional fake news detection methods are falling behind as multimodal misinformation grows more advanced, seamlessly blending deceptive text, manipulated visuals, and factually incorrect claims. Most prior work focuses on text-image fusion or applies external knowledge only as a post-processing step, limiting their ability to detect deeper semantic inconsistencies. In this paper, we introduce KITE (Knowledge-Integrated Text-Image Encoder), a tri-modal fake news detection framework that jointly models textual, visual, and factual knowledge representations. KITE leverages Roberta [23,14] and CLIP [24] for linguistic and visual encoding, while a Graph Attention Network (GAT) processes structured facts retrieved from Wikidata. KITE uses cross-modal attention [9] within a multimodal transformer to integrate text, visual, and knowledge features, helping it understand how each modality relates to one another. Modality-specific confidence scores are generated alongside the final prediction, offering interpretability by indicating which input type most influenced the decision. Evaluations on benchmark datasets demonstrate that KITE significantly outperforms unimodal and bimodal baselines, particularly in scenarios involving image-text mismatches or contradictions with external knowledge.

Problem

Research questions and friction points this paper is trying to address.

fake news detection

multimodal misinformation

semantic inconsistency

knowledge integration

text-image mismatch

Innovation

Methods, ideas, or system contributions that make the work stand out.

tri-modal fusion

knowledge graph integration

cross-modal attention