Multimodal Fusion via Hypergraph Autoencoder and Contrastive Learning for Emotion Recognition in Conversation

📅 2024-08-02

🏛️ ACM Multimedia

📈 Citations: 4

✨ Influential: 0

career value

186K/year

🤖 AI Summary

In multimodal emotion recognition in conversations (MERC), existing methods struggle to jointly model speaker identity and both short- and long-range contextual dependencies, while fixed graph structures induce redundancy and over-smoothing. To address these issues, this paper proposes a dynamic hypergraph modeling framework based on a variational hypergraph autoencoder (VHGAE). Instead of relying on predefined fully connected graphs, our approach employs variational inference to adaptively learn semantic-aware, high-order conversational relationships; contrastive learning is further introduced to mitigate uncertainty in feature reconstruction. Additionally, we design a cross-modal aligned multimodal graph neural network to enable fine-grained inter-modal fusion. Extensive experiments demonstrate that our method achieves significant improvements over state-of-the-art approaches on the IEMOCAP and MELD benchmarks. The source code is publicly available.

Technology Category

Application Category

📝 Abstract

Multimodal emotion recognition in conversation (MERC) seeks to identify the speakers' emotions expressed in each utterance, offering significant potential across diverse fields. The challenge of MERC lies in balancing speaker modeling and context modeling, encompassing both long-distance and short-distance contexts, as well as addressing the complexity of multimodal information fusion. Recent research adopts graph-based methods to model intricate conversational relationships effectively. Nevertheless, the majority of these methods utilize a fixed fully connected structure to link all utterances, relying on convolution to interpret complex context. This approach can inherently heighten the redundancy in contextual messages and excessive graph network smoothing, particularly in the context of long-distance conversations. To address this issue, we propose a framework that dynamically adjusts hypergraph connections by variational hypergraph autoencoder (VHGAE), and employs contrastive learning to mitigate uncertainty factors during the reconstruction process. Experimental results demonstrate the effectiveness of our proposal against the state-of-the-art methods on IEMOCAP and MELD datasets. We release the code to support the reproducibility of this work at https://github.com/yzjred/-HAUCL.

Problem

Research questions and friction points this paper is trying to address.

Balancing speaker and context modeling in MERC

Reducing redundancy in long-distance conversation contexts

Improving multimodal fusion complexity in emotion recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic hypergraph connections via VHGAE

Contrastive learning reduces reconstruction uncertainty

Outperforms state-of-the-art on IEMOCAP, MELD

🔎 Similar Papers

Tracing Intricate Cues in Dialogue: Joint Graph Structure and Sentiment Dynamics for Multimodal Emotion Recognition