CAGMamba: Context-Aware Gated Cross-Modal Mamba Network for Multimodal Sentiment Analysis

📅 2026-04-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal sentiment analysis faces significant challenges in high computational complexity of cross-modal interactions and insufficient modeling of dynamic emotional shifts in conversations. This work proposes CAGMamba, a novel framework that introduces the Mamba state space model to this domain for the first time. By constructing a temporal binary sequence of context and current utterance, it explicitly captures sentiment evolution over time. Furthermore, a gated cross-modal Mamba network is designed to enable efficient and controllable modality fusion, enhancing inter-modal information exchange while preserving modality-specific characteristics. Integrated with a three-branch multi-task learning strategy, the model achieves state-of-the-art or highly competitive performance across three benchmark datasets, substantially advancing the effectiveness of multimodal sentiment analysis.
📝 Abstract
Multimodal Sentiment Analysis (MSA) requires effective modeling of cross-modal interactions and contextual dependencies while remaining computationally efficient. Existing fusion approaches predominantly rely on Transformer-based cross-modal attention, which incurs quadratic complexity with respect to sequence length and limits scalability. Moreover, contextual information from preceding utterances is often incorporated through concatenation or independent fusion, without explicit temporal modeling that captures sentiment evolution across dialogue turns. To address these limitations, we propose CAGMamba, a context-aware gated cross-modal Mamba framework for dialogue-based sentiment analysis. Specifically, we organize the contextual and the current-utterance features into a temporally ordered binary sequence, which provides Mamba with explicit temporal structure for modeling sentiment evolution. To further enable controllable cross-modal integration, we propose a Gated Cross-Modal Mamba Network (GCMN) that integrates cross-modal and unimodal paths via learnable gating to balance information fusion and modality preservation, and is trained with a three-branch multi-task objective over text, audio, and fused predictions. Experiments on three benchmark datasets demonstrate that CAGMamba achieves state-of-the-art or competitive results across multiple evaluation metrics. All codes are available at https://github.com/User2024-xj/CAGMamba.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Sentiment Analysis
Cross-Modal Interaction
Contextual Dependency
Temporal Modeling
Computational Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba
cross-modal fusion
context-aware modeling
gated integration
multimodal sentiment analysis
M
Minghai Jiao
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Jing Xiao
Jing Xiao
Beijing Key Laboratory of Learning and Cognition, School of Psychology, Capital Normal University
cognitive vulnerability to depressionschool psychologycognition and learning
P
Peng Xiao
Xingning Power Supply Bureau, Guangdong Power Grid Co., Ltd., Meizhou, China
E
Ende Zhang
School of Computer Science and Engineering, Northeastern University, Shenyang, China
S
Shuang Kan
School of Computer Science and Engineering, Northeastern University, Shenyang, China
W
Wenyan Jiang
School of Computer Science and Engineering, Northeastern University, Shenyang, China
J
Jinyao Li
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Y
Yixian Liu
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Haidong Xin
Haidong Xin
Northeastern University; Harbin Engineering University
NLPInformation RetrievalRAGRecsysMulti-agent