🤖 AI Summary
To address the challenge of jointly modeling modality-shared and modality-specific information in multimodal sentiment analysis, this paper proposes a Hierarchical Adaptive Expert Framework (HAEF). HAEF employs hierarchical attention and modality-adaptive gating mechanisms to enable dynamic fusion of textual, acoustic, and visual modalities, while ensuring robust recognition under both partial- and full-modality input settings. It introduces, for the first time, a joint optimization strategy that simultaneously searches optimal network architectures and modality combinations using evolutionary algorithms, thereby unifying cross-modal knowledge transfer and multi-task learning. Extensive experiments on CMU-MOSEI, CMU-MOSI, and IEMOCAP benchmarks demonstrate significant improvements: +2.6% and +6.3% absolute gains in 7-class accuracy, +2.84% in weighted F1-score, and reductions of 0.059 and 0.058 in MAE—outperforming state-of-the-art methods consistently.
📝 Abstract
Multimodal sentiment analysis has emerged as a critical tool for understanding human emotions across diverse communication channels. While existing methods have made significant strides, they often struggle to effectively differentiate and integrate modality-shared and modality-specific information, limiting the performance of multimodal learning. To address this challenge, we propose the Hierarchical Adaptive Expert for Multimodal Sentiment Analysis (HAEMSA), a novel framework that synergistically combines evolutionary optimization, cross-modal knowledge transfer, and multi-task learning. HAEMSA employs a hierarchical structure of adaptive experts to capture both global and local modality representations, enabling more nuanced sentiment analysis. Our approach leverages evolutionary algorithms to dynamically optimize network architectures and modality combinations, adapting to both partial and full modality scenarios. Extensive experiments demonstrate HAEMSA's superior performance across multiple benchmark datasets. On CMU-MOSEI, HAEMSA achieves a 2.6% increase in 7-class accuracy and a 0.059 decrease in MAE compared to the previous best method. For CMU-MOSI, we observe a 6.3% improvement in 7-class accuracy and a 0.058 reduction in MAE. On IEMOCAP, HAEMSA outperforms the state-of-the-art by 2.84% in weighted-F1 score for emotion recognition. These results underscore HAEMSA's effectiveness in capturing complex multimodal interactions and generalizing across different emotional contexts.