HeLo: Heterogeneous Multi-Modal Fusion with Label Correlation for Emotion Distribution Learning

📅 2025-07-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient modeling of modality heterogeneity and weak semantic correlation among emotion labels in multimodal affective distribution learning, this paper proposes HeLo—a novel framework. HeLo introduces an optimal transport–based heterogeneity mining module to explicitly capture structural discrepancies between physiological and behavioral modalities. It further designs a learnable label embedding mechanism integrated with label-correlation–driven cross-attention, enabling joint optimization of cross-modal features and semantic label relationships. Extensive experiments on two benchmark datasets demonstrate that HeLo achieves significant improvements over state-of-the-art methods in emotion distribution prediction. Ablation studies confirm the effectiveness of both the heterogeneity-aware representation learning and the label-semantic modeling components. Overall, HeLo establishes a new paradigm for multimodal affective recognition by synergistically modeling modality heterogeneity and label semantics.

Technology Category

Application Category

📝 Abstract
Multi-modal emotion recognition has garnered increasing attention as it plays a significant role in human-computer interaction (HCI) in recent years. Since different discrete emotions may exist at the same time, compared with single-class emotion recognition, emotion distribution learning (EDL) that identifies a mixture of basic emotions has gradually emerged as a trend. However, existing EDL methods face challenges in mining the heterogeneity among multiple modalities. Besides, rich semantic correlations across arbitrary basic emotions are not fully exploited. In this paper, we propose a multi-modal emotion distribution learning framework, named HeLo, aimed at fully exploring the heterogeneity and complementary information in multi-modal emotional data and label correlation within mixed basic emotions. Specifically, we first adopt cross-attention to effectively fuse the physiological data. Then, an optimal transport (OT)-based heterogeneity mining module is devised to mine the interaction and heterogeneity between the physiological and behavioral representations. To facilitate label correlation learning, we introduce a learnable label embedding optimized by correlation matrix alignment. Finally, the learnable label embeddings and label correlation matrices are integrated with the multi-modal representations through a novel label correlation-driven cross-attention mechanism for accurate emotion distribution learning. Experimental results on two publicly available datasets demonstrate the superiority of our proposed method in emotion distribution learning.
Problem

Research questions and friction points this paper is trying to address.

Mining heterogeneity in multi-modal emotional data
Exploiting semantic correlations across basic emotions
Improving emotion distribution learning accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-attention for physiological data fusion
Optimal transport-based heterogeneity mining
Learnable label embedding with correlation alignment
🔎 Similar Papers
No similar papers found.