Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection

📅 2025-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient cross-modal interaction and static fusion strategies in multimodal harmful content detection, this paper proposes a dynamic dual-path collaborative modeling framework. Methodologically, we design a dual-path cross-modal encoder to achieve disentangled representation of heterogeneous text-image features; introduce a fine-grained, dimension-level gated collaborative attention mechanism to enable channel-adaptive selection and multi-granularity semantic alignment; and construct a gated self-attention-based expert fusion module to replace conventional static fusion paradigms. Our key innovation lies in the first integration of dimension-level gating into collaborative attention, enabling dynamic, interpretable cross-modal alignment. Extensive experiments demonstrate state-of-the-art performance on the MIMIC and SemEval Memotion 1.0 benchmarks, with significant improvements in detection accuracy and robustness.

Technology Category

Application Category

📝 Abstract
Multi-modal learning has become a critical research area because integrating text and image data can significantly improve performance in tasks such as classification, retrieval, and scene understanding. However, despite progress with pre-trained models, current approaches are limited by inadequate cross-modal interactions and static fusion strategies that do not fully exploit the complementary nature of different modalities. To address these shortcomings, we introduce a novel multi-modal Co-AttenDWG architecture that leverages dual-path encoding, co-attention with dimension-wise gating, and advanced expert fusion. Our approach begins by projecting text and image features into a common embedding space, where a dedicated co-attention mechanism enables simultaneous, fine-grained interactions between modalities. This mechanism is further enhanced by a dimension-wise gating network that adaptively regulates the feature contributions at the channel level, ensuring that only the most relevant information is emphasized. In parallel, dual-path encoders refine the representations by processing cross-modal information separately before an additional cross-attention layer further aligns modalities. The refined features are then aggregated via an expert fusion module that combines learned gating and self-attention to produce a robust, unified representation. We validate our approach on the MIMIC and SemEval Memotion 1.0, where experimental results demonstrate significant improvements in cross-modal alignment and state-of-the-art performance, underscoring the potential of our model for a wide range of multi-modal applications.
Problem

Research questions and friction points this paper is trying to address.

Enhancing cross-modal interactions in multi-modal learning
Improving static fusion strategies for better modality integration
Detecting offensive content using advanced multi-modal fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-path encoding for cross-modal refinement
Dimension-wise gating for adaptive feature regulation
Expert fusion with learned gating and self-attention
🔎 Similar Papers
No similar papers found.
M
Md. Mithun Hossain
Department of Computer Science and Engineering, Bangladesh University of Business and Technology, Dhaka 1216, Bangladesh
Md. Shakil Hossain
Md. Shakil Hossain
Assistant Professor of Mathematics, Khulna University of Engineering & Technology
Applied Mathematics
S
S. Chaki
Department of Computer Science and Engineering, Bangladesh University of Business and Technology, Dhaka 1216, Bangladesh
M
M. F. Mridha
Department of Computer Science, American International University-Bangladesh, Dhaka 1229, Bangladesh