Sync-TVA: A Graph-Attention Framework for Multimodal Emotion Recognition with Cross-Modal Fusion

📅 2025-07-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address weak cross-modal interaction and imbalanced modality contributions in multimodal emotion recognition (MER), this paper proposes a dynamic enhancement and heterogeneous graph co-modeling framework. Methodologically, it introduces: (1) a modality-specific dynamic feature enhancement module that adaptively calibrates representation strength per modality; (2) a heterogeneous cross-modal graph structure explicitly encoding asymmetric semantic relationships among text, audio, and visual modalities; and (3) a cross-modal cross-attention mechanism to improve fine-grained semantic alignment and emotion reasoning. The framework is trained end-to-end, incorporating strategies to mitigate class imbalance. Extensive experiments on MELD and IEMOCAP demonstrate significant improvements over state-of-the-art methods in both accuracy and weighted F1-score, validating the effectiveness and robustness of the proposed cross-modal co-modeling approach.

Technology Category

Application Category

📝 Abstract
Multimodal emotion recognition (MER) is crucial for enabling emotionally intelligent systems that perceive and respond to human emotions. However, existing methods suffer from limited cross-modal interaction and imbalanced contributions across modalities. To address these issues, we propose Sync-TVA, an end-to-end graph-attention framework featuring modality-specific dynamic enhancement and structured cross-modal fusion. Our design incorporates a dynamic enhancement module for each modality and constructs heterogeneous cross-modal graphs to model semantic relations across text, audio, and visual features. A cross-attention fusion mechanism further aligns multimodal cues for robust emotion inference. Experiments on MELD and IEMOCAP demonstrate consistent improvements over state-of-the-art models in both accuracy and weighted F1 score, especially under class-imbalanced conditions.
Problem

Research questions and friction points this paper is trying to address.

Limited cross-modal interaction in emotion recognition.
Imbalanced contributions across different modalities.
Need for robust multimodal emotion inference.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-attention framework for multimodal emotion recognition
Dynamic enhancement module for each modality
Cross-attention fusion mechanism aligns multimodal cues
Z
Zeyu Deng
James Watt School of Engineering, University of Glasgow, Glasgow G12 8QQ, U.K.
Y
Yanhui Lu
School of Engineering Mathematics and Technology, University of Bristol, U.K.
J
Jiashu Liao
School of Computing Science, University of Glasgow, U.K.
S
Shuang Wu
Department of Civil, Environmental & Geomatic Engineering, University College London (UCL), U.K.
Chongfeng Wei
Chongfeng Wei
Associate Professor, University of Glasgow
Human-Robot InteractionDecision Making and ControlNonlinear Mechanical Systems