DashFusion: Dual-stream Alignment with Hierarchical Bottleneck Fusion for Multimodal Sentiment Analysis

📅 2025-12-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Multimodal sentiment analysis (MSA) faces dual challenges: difficulty in achieving cross-modal temporal and semantic alignment, and low fusion efficiency. To address these, we propose a dual-stream alignment and hierarchical bottleneck fusion framework. First, frame-level temporal alignment is achieved via cross-modal attention, while supervised contrastive learning enforces semantic consistency across modalities. Second, a multi-stage fusion strategy is introduced, leveraging compressed bottleneck tokens to preserve discriminative information while substantially reducing computational overhead. Our method achieves state-of-the-art performance on CMU-MOSI, CMU-MOSEI, and CH-SIMS. Ablation studies quantitatively validate the effectiveness of both the dual-stream alignment mechanism and the hierarchical bottleneck design, confirming their complementary synergy and individual contributions to overall performance.

Technology Category

Application Category

📝 Abstract

Multimodal sentiment analysis (MSA) integrates various modalities, such as text, image, and audio, to provide a more comprehensive understanding of sentiment. However, effective MSA is challenged by alignment and fusion issues. Alignment requires synchronizing both temporal and semantic information across modalities, while fusion involves integrating these aligned features into a unified representation. Existing methods often address alignment or fusion in isolation, leading to limitations in performance and efficiency. To tackle these issues, we propose a novel framework called Dual-stream Alignment with Hierarchical Bottleneck Fusion (DashFusion). Firstly, dual-stream alignment module synchronizes multimodal features through temporal and semantic alignment. Temporal alignment employs cross-modal attention to establish frame-level correspondences among multimodal sequences. Semantic alignment ensures consistency across the feature space through contrastive learning. Secondly, supervised contrastive learning leverages label information to refine the modality features. Finally, hierarchical bottleneck fusion progressively integrates multimodal information through compressed bottleneck tokens, which achieves a balance between performance and computational efficiency. We evaluate DashFusion on three datasets: CMU-MOSI, CMU-MOSEI, and CH-SIMS. Experimental results demonstrate that DashFusion achieves state-of-the-art performance across various metrics, and ablation studies confirm the effectiveness of our alignment and fusion techniques. The codes for our experiments are available at https://github.com/ultramarineX/DashFusion.

Problem

Research questions and friction points this paper is trying to address.

Addresses multimodal alignment and fusion challenges in sentiment analysis

Synchronizes temporal and semantic information across different modalities

Balances performance and computational efficiency in multimodal integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-stream alignment synchronizes temporal and semantic multimodal features

Hierarchical bottleneck fusion integrates information through compressed tokens

Supervised contrastive learning refines features using label information

🔎 Similar Papers

No similar papers found.

Authors to Follow