π€ AI Summary
To address the lack of interpretable and verifiable natural-language explanations in multimodal fake news video detection, this paper introduces the novel Fake News Video Explanation (FNVE) task and presents FakeNVEβthe first multimodal fake news video dataset annotated with human-written explanatory rationales. Methodologically, we propose a multimodal Transformer-based cross-modal alignment encoder that fuses visual frames, audio, and subtitle text features, coupled with a BART autoregressive decoder to generate attributional English explanations. Experiments demonstrate that our approach significantly outperforms baselines across BLEU, ROUGE, and BERTScore metrics, as well as human evaluation (92.3% sufficiency, 94.1% fluency), achieving a favorable balance between explanation readability and factual consistency. This work establishes a new explainable paradigm for multimodal fake news detection.
π Abstract
Multi-modal explanation involves the assessment of the veracity of a variety of different content, and relies on multiple information modalities to comprehensively consider the relevance and consistency between modalities. Most existing fake news video detection methods focus on improving accuracy while ignoring the importance of providing explanations. In this paper, we propose a novel problem - Fake News Video Explanation (FNVE) - Given a multimodal news containing both video and caption text, we aim to generate natural language explanations to reveal the truth of predictions. To this end, we develop FakeNVE, a new dataset of explanations for truthfully multimodal posts, where each explanation is a natural language (English) sentence describing the attribution of a news thread. We benchmark FakeNVE by using a multimodal transformer-based architecture. Subsequently, a BART-based autoregressive decoder is used as the generator. Empirical results show compelling results for various baselines (applicable to FNVE) across multiple evaluation metrics. We also perform human evaluation on explanation generation, achieving high scores for both adequacy and fluency.