🤖 AI Summary
Addressing the challenge of early multimodal misinformation detection on social media during crises such as elections and pandemics, existing methods predominantly rely on unimodal (text or image) features, neglecting synergistic modeling of linguistic, visual, and social signals. This paper proposes an early-fusion framework that jointly encodes: (i) OCR-extracted textual and visual content features; (ii) object-detection-derived visual semantic features; and (iii) social graph features—including retweet topology and user attributes. Leveraging unsupervised pretraining followed by supervised fine-tuning, the framework enables end-to-end multimodal representation learning. Evaluated on a real-world dataset of 1,529 tweets, it achieves 15% and 5% absolute accuracy gains over state-of-the-art unimodal and bimodal baselines, respectively. Furthermore, the study uncovers modality-specific dependencies and temporal diffusion patterns in cross-event rumor propagation.
📝 Abstract
Amid a tidal wave of misinformation flooding social media during elections and crises, extensive research has been conducted on misinformation detection, primarily focusing on text-based or image-based approaches. However, only a few studies have explored multimodal feature combinations, such as integrating text and images for building a classification model to detect misinformation. This study investigates the effectiveness of different multimodal feature combinations, incorporating text, images, and social features using an early fusion approach for the classification model. This study analyzed 1,529 tweets containing both text and images during the COVID-19 pandemic and election periods collected from Twitter (now X). A data enrichment process was applied to extract additional social features, as well as visual features, through techniques such as object detection and optical character recognition (OCR). The results show that combining unsupervised and supervised machine learning models improves classification performance by 15% compared to unimodal models and by 5% compared to bimodal models. Additionally, the study analyzes the propagation patterns of misinformation based on the characteristics of misinformation tweets and the users who disseminate them.