DashFusion: Dual-stream Alignment with Hierarchical Bottleneck Fusion for Multimodal Sentiment Analysis

📅 2025-12-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal sentiment analysis (MSA) faces dual challenges: difficulty in achieving cross-modal temporal and semantic alignment, and low fusion efficiency. To address these, we propose a dual-stream alignment and hierarchical bottleneck fusion framework. First, frame-level temporal alignment is achieved via cross-modal attention, while supervised contrastive learning enforces semantic consistency across modalities. Second, a multi-stage fusion strategy is introduced, leveraging compressed bottleneck tokens to preserve discriminative information while substantially reducing computational overhead. Our method achieves state-of-the-art performance on CMU-MOSI, CMU-MOSEI, and CH-SIMS. Ablation studies quantitatively validate the effectiveness of both the dual-stream alignment mechanism and the hierarchical bottleneck design, confirming their complementary synergy and individual contributions to overall performance.

Technology Category

Application Category

📝 Abstract
Multimodal sentiment analysis (MSA) integrates various modalities, such as text, image, and audio, to provide a more comprehensive understanding of sentiment. However, effective MSA is challenged by alignment and fusion issues. Alignment requires synchronizing both temporal and semantic information across modalities, while fusion involves integrating these aligned features into a unified representation. Existing methods often address alignment or fusion in isolation, leading to limitations in performance and efficiency. To tackle these issues, we propose a novel framework called Dual-stream Alignment with Hierarchical Bottleneck Fusion (DashFusion). Firstly, dual-stream alignment module synchronizes multimodal features through temporal and semantic alignment. Temporal alignment employs cross-modal attention to establish frame-level correspondences among multimodal sequences. Semantic alignment ensures consistency across the feature space through contrastive learning. Secondly, supervised contrastive learning leverages label information to refine the modality features. Finally, hierarchical bottleneck fusion progressively integrates multimodal information through compressed bottleneck tokens, which achieves a balance between performance and computational efficiency. We evaluate DashFusion on three datasets: CMU-MOSI, CMU-MOSEI, and CH-SIMS. Experimental results demonstrate that DashFusion achieves state-of-the-art performance across various metrics, and ablation studies confirm the effectiveness of our alignment and fusion techniques. The codes for our experiments are available at https://github.com/ultramarineX/DashFusion.
Problem

Research questions and friction points this paper is trying to address.

Addresses multimodal alignment and fusion challenges in sentiment analysis
Synchronizes temporal and semantic information across different modalities
Balances performance and computational efficiency in multimodal integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-stream alignment synchronizes temporal and semantic multimodal features
Hierarchical bottleneck fusion integrates information through compressed tokens
Supervised contrastive learning refines features using label information
🔎 Similar Papers
No similar papers found.
Y
Yuhua Wen
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China
Q
Qifei Li
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China
Y
Yingying Zhou
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China
Yingming Gao
Yingming Gao
Beijing University of Posts and Telecommunications
Computer Assisted Language LearningAcoustic Phonetics and Speech Synthesis
Zhengqi Wen
Zhengqi Wen
Tshinghua University
LLM
J
Jianhua Tao
Department of Automation, Tsinghua University, Beijing 100084, China, and also with the Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China
Y
Ya Li
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China