🤖 AI Summary
Existing multimodal sentiment analysis methods struggle to dynamically model inter-modal interactions and are often biased by sentiment cues inherent in the linguistic modality. To address these limitations, this work proposes the Multimodal Causal Attention Fusion (MCAF) framework, which uniquely integrates causal intervention with multi-granularity dynamic routing to achieve fine-grained interaction awareness and bias disentanglement across feature, temporal, and modal levels. Built upon the information bottleneck principle, MCAF formulates a structural causal model and introduces a multi-granularity causal dynamic router that adaptively identifies complementary, conflicting, or redundant relationships among modalities. A conditional diffusion denoising module is further incorporated to enhance robustness. The framework achieves state-of-the-art performance on CMU-MOSI and CMU-MOSEI, yielding Acc-2/F1 scores of 86.52%/86.51% and 86.72%/86.65%, respectively.
📝 Abstract
Although Multimodal Sentiment Analysis (MSA) effectively leverages rich information from language, visual, and acoustic modalities, existing methods still face two core challenges: 1) static conflict suppression mechanisms fail to adapt to dynamic variations across samples, and 2) the inherent sentimental bias within the language modality, which can misguide learning from other modalities, remains entangled. To this end, we propose a Dynamic Multimodal Causal Disentanglement and Adaptive Fusion Framework (MCAF). Its cornerstone is the Multi-Granularity Causal Dynamic Router and a Conditional Diffusion Denoising Module. First, we introduce a causal intervention module based on the information bottleneck principle, which builds a Structural Causal Model to disentangle sentimental bias from language features, yielding a "de-confounded" language representation as a pure guiding signal. Second, we devise a Dynamic Multimodal Router that evaluates the interaction states (complementary, conflicting, or redundant) among visual, acoustic, and de-confounded language signals in real-time across three levels: feature, temporal, and modality, then adaptively allocates weights and routes information flow for fine-grained regulation. Finally, a lightweight Conditional Diffusion Denoising Module performs iterative denoising on the fused joint representation to explicitly filter out residual irrelevant information, generating a robust hyper-modality representation. Extensive experiments on the CMU-MOSI and CMU-MOSEI benchmarks show that MCAF sets new state-of-the-art on key classification metrics, achieving an Acc-2/F1 of 86.52%/86.51% on MOSI and 86.72%/86.65% on MOSEI, while remaining highly competitive on others. Comprehensive analyses and visualizations further validate its efficacy in dynamically perceiving interactions, disentangling bias, and enhancing interpretability.