๐ค AI Summary
To address the limited detection accuracy of fake news caused by isolated unimodal modeling of text and images, this paper proposes TT-BLIP, an end-to-end multimodal fake news detection model. Its key contributions are: (1) a novel tri-modal Tri-Transformer architecture integrating three parallel attention streamsโtext-to-image, image-to-text, and text-to-textโto jointly model cross-modal interactions; (2) tight coupling of the BLIP bidirectional vision-language encoder with three dedicated image adaptation modules, enabling fine-grained cross-modal alignment and joint representation learning; and (3) fully differentiable end-to-end optimization. Extensive experiments on Weibo and GossipCop demonstrate that TT-BLIP significantly outperforms existing state-of-the-art methods, validating the substantial performance gains enabled by deep, structured multimodal fusion for fake news detection.
๐ Abstract
Detecting fake news has received a lot of attention. Many previous methods concatenate independently encoded unimodal data, ignoring the benefits of integrated multimodal information. Also, the absence of specialized feature extraction for text and images further limits these methods. This paper introduces an end-to-end model called TT-BLIP that applies the bootstrapping language-image pretraining for unified visionlanguage understanding and generation (BLIP) for three types for images, and bidirectional BLIP encoders for multimodal information. The Multimodal Tri-Transformer fuses tri-modal features using three types of multi-head attention mechanisms, ensuring integrated modalities for enhanced representations and improved multimodal data analysis. The experiments are performed using two fake news datasets, Weibo and Gossipcop. The results indicate TT-BLIP outperforms the state-of-the-art models.