🤖 AI Summary
Existing multimodal sentiment analysis methods suffer from two key limitations: sensitivity to unimodal noise and redundancy in cross-modal fusion. To address these, we propose a dual information bottleneck framework. First, a unimodal information bottleneck based on low-rank Rényi entropy compresses task-irrelevant noise while preserving discriminative features. Second, an attention-driven cross-modal information bottleneck dynamically selects complementary multimodal interactions and suppresses redundant fusion. The framework ensures computational tractability while significantly enhancing representation robustness and discriminability. Our method achieves state-of-the-art performance on CMU-MOSI (Acc-7: 47.4%) and CH-SIMS (F1: 81.63%, +1.19% relative improvement). Under strong artificial noise, performance degrades by only 0.29–0.36%, demonstrating superior noise resilience and generalization capability.
📝 Abstract
Multimodal sentiment analysis has received significant attention across diverse research domains. Despite advancements in algorithm design, existing approaches suffer from two critical limitations: insufficient learning of noise-contaminated unimodal data, leading to corrupted cross-modal interactions, and inadequate fusion of multimodal representations, resulting in discarding discriminative unimodal information while retaining multimodal redundant information. To address these challenges, this paper proposes a Double Information Bottleneck (DIB) strategy to obtain a powerful, unified compact multimodal representation. Implemented within the framework of low-rank Renyi's entropy functional, DIB offers enhanced robustness against diverse noise sources and computational tractability for high-dimensional data, as compared to the conventional Shannon entropy-based methods. The DIB comprises two key modules: 1) learning a sufficient and compressed representation of individual unimodal data by maximizing the task-relevant information and discarding the superfluous information, and 2) ensuring the discriminative ability of multimodal representation through a novel attention bottleneck fusion mechanism. Consequently, DIB yields a multimodal representation that effectively filters out noisy information from unimodal data while capturing inter-modal complementarity. Extensive experiments on CMU-MOSI, CMU-MOSEI, CH-SIMS, and MVSA-Single validate the effectiveness of our method. The model achieves 47.4% accuracy under the Acc-7 metric on CMU-MOSI and 81.63% F1-score on CH-SIMS, outperforming the second-best baseline by 1.19%. Under noise, it shows only 0.36% and 0.29% performance degradation on CMU-MOSI and CMU-MOSEI respectively.