Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification

📅 2026-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of detecting complex multimodal disinformation, where existing approaches relying on global fusion often suffer from feature dilution and overlook local semantic inconsistencies. To this end, the authors propose the MaLSF framework, which innovatively leverages mask–label pairs as semantic anchors to enable fine-grained, bidirectional cross-modal verification between pixels and text. The framework integrates a Bidirectional Cross-modal Verification (BCV) module and a Hierarchical Semantic Aggregation (HSA) mechanism, facilitating active mutual validation and precise localization of image–text conflicts. It further incorporates mask-aware local fusion, a bidirectional query strategy, and multi-granularity conflict signal aggregation, supported by diverse parsers that extract fine-grained mask–label pairs. Evaluated on the DGM4 benchmark and multimodal fake news detection tasks, MaLSF achieves state-of-the-art performance, with ablation studies and visualizations confirming its effectiveness and interpretability.
📝 Abstract
As multimodal misinformation becomes more sophisticated, its detection and grounding are crucial. However, current multimodal verification methods, relying on passive holistic fusion, struggle with sophisticated misinformation. Due to 'feature dilution,' global alignments tend to average out subtle local semantic inconsistencies, effectively masking the very conflicts they are designed to find. We introduce MaLSF (Mask-aware Local Semantic Fusion), a novel framework that shifts the paradigm to active, bidirectional verification, mimicking human cognitive cross-referencing. MaLSF utilizes mask-label pairs as semantic anchors to bridge pixels and words. Its core mechanism features two innovations: 1) a Bidirectional Cross-modal Verification (BCV) module that acts as an interrogator, using parallel query streams (Text-as-Query and Image-as-Query) to explicitly pinpoint conflicts; and 2) a Hierarchical Semantic Aggregation (HSA) module that intelligently aggregates these multi-granularity conflict signals for task-specific reasoning. In addition, to extract fine-grained mask-label pairs, we introduce a set of diverse mask-label pair extraction parsers. MaLSF achieves state-of-the-art performance on both the DGM4 and multimodal fake news detection tasks. Extensive ablation studies and visualization results further verify its effectiveness and interpretability.
Problem

Research questions and friction points this paper is trying to address.

multimodal misinformation
feature dilution
local semantic inconsistency
multimodal verification
semantic conflict
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mask-aware Local Semantic Fusion
Bidirectional Cross-modal Verification
Hierarchical Semantic Aggregation
multimodal verification
mask-label pair extraction
🔎 Similar Papers
No similar papers found.
Z
Zizhao Chen
State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
Ping Wei
Ping Wei
Fudan university
Multimedia securityImage synthesis
Z
Ziyang Ren
State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
H
Huan Li
State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
X
Xiangru Yin
State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University