Temporal-Spatial Decouple before Act: Disentangled Representation Learning for Multimodal Sentiment Analysis

📅 2026-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of heterogeneity neglect and information asymmetry in multimodal sentiment analysis caused by conflated spatiotemporal modeling. It proposes a novel approach that explicitly decouples each modality into temporal dynamics and spatial structure representations. The framework employs a dual time–space encoder, factor-consistent cross-modal alignment, factor-specific supervision, and decorrelation regularization, followed by a gated re-coupling module for effective fusion. By introducing factor-level alignment and a leakage-prevention mechanism, the method enhances model interpretability while achieving state-of-the-art performance across multiple benchmark datasets. Ablation studies further confirm the effectiveness and necessity of each component in the proposed architecture.

Technology Category

Application Category

📝 Abstract
Multimodal Sentiment Analysis integrates Linguistic, Visual, and Acoustic. Mainstream approaches based on modality-invariant and modality-specific factorization or on complex fusion still rely on spatiotemporal mixed modeling. This ignores spatiotemporal heterogeneity, leading to spatiotemporal information asymmetry and thus limited performance. Hence, we propose TSDA, Temporal-Spatial Decouple before Act, which explicitly decouples each modality into temporal dynamics and spatial structural context before any interaction. For every modality, a temporal encoder and a spatial encoder project signals into separate temporal and spatial body. Factor-Consistent Cross-Modal Alignment then aligns temporal features only with their temporal counterparts across modalities, and spatial features only with their spatial counterparts. Factor specific supervision and decorrelation regularization reduce cross factor leakage while preserving complementarity. A Gated Recouple module subsequently recouples the aligned streams for task. Extensive experiments show that TSDA outperforms baselines. Ablation analysis studies confirm the necessity and interpretability of the design.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Sentiment Analysis
Spatiotemporal Heterogeneity
Information Asymmetry
Disentangled Representation
Modality Fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal-Spatial Decoupling
Disentangled Representation Learning
Multimodal Sentiment Analysis
Cross-Modal Alignment
Factor-Consistent Supervision
Chunlei Meng
Chunlei Meng
Fudan University
Embodied Ai,Multimodal,Multi-agent
Z
Ziyang Zhou
Shantou University
L
Lucas He
University College London
X
Xiaojing Du
University of South Australia
Chun Ouyang
Chun Ouyang
Associate Professor, PhD, Queensland University of Technology
Process MiningExplainable AIPredictive AnalyticsAI RobustnessMachine Learning
Z
Zhongxue Gan
Fudan University