Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection

📅 2025-05-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address insufficient cross-modal interaction and static fusion strategies in multimodal harmful content detection, this paper proposes a dynamic dual-path collaborative modeling framework. Methodologically, we design a dual-path cross-modal encoder to achieve disentangled representation of heterogeneous text-image features; introduce a fine-grained, dimension-level gated collaborative attention mechanism to enable channel-adaptive selection and multi-granularity semantic alignment; and construct a gated self-attention-based expert fusion module to replace conventional static fusion paradigms. Our key innovation lies in the first integration of dimension-level gating into collaborative attention, enabling dynamic, interpretable cross-modal alignment. Extensive experiments demonstrate state-of-the-art performance on the MIMIC and SemEval Memotion 1.0 benchmarks, with significant improvements in detection accuracy and robustness.

Technology Category

Application Category

📝 Abstract

Multi-modal learning has become a critical research area because integrating text and image data can significantly improve performance in tasks such as classification, retrieval, and scene understanding. However, despite progress with pre-trained models, current approaches are limited by inadequate cross-modal interactions and static fusion strategies that do not fully exploit the complementary nature of different modalities. To address these shortcomings, we introduce a novel multi-modal Co-AttenDWG architecture that leverages dual-path encoding, co-attention with dimension-wise gating, and advanced expert fusion. Our approach begins by projecting text and image features into a common embedding space, where a dedicated co-attention mechanism enables simultaneous, fine-grained interactions between modalities. This mechanism is further enhanced by a dimension-wise gating network that adaptively regulates the feature contributions at the channel level, ensuring that only the most relevant information is emphasized. In parallel, dual-path encoders refine the representations by processing cross-modal information separately before an additional cross-attention layer further aligns modalities. The refined features are then aggregated via an expert fusion module that combines learned gating and self-attention to produce a robust, unified representation. We validate our approach on the MIMIC and SemEval Memotion 1.0, where experimental results demonstrate significant improvements in cross-modal alignment and state-of-the-art performance, underscoring the potential of our model for a wide range of multi-modal applications.

Problem

Research questions and friction points this paper is trying to address.

Enhancing cross-modal interactions in multi-modal learning

Improving static fusion strategies for better modality integration

Detecting offensive content using advanced multi-modal fusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-path encoding for cross-modal refinement

Dimension-wise gating for adaptive feature regulation

Expert fusion with learned gating and self-attention

🔎 Similar Papers

No similar papers found.

Authors to Follow