Centering Emotion Hotspots: Multimodal Local-Global Fusion and Cross-Modal Alignment for Emotion Recognition in Conversations

๐Ÿ“… 2025-10-07
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Emotion Recognition in Conversations (ERC) faces challenges stemming from sparse, localized, and asynchronous multimodal evidence. To address these, we propose a multimodal ERC framework centered on โ€œemotion hotspotsโ€: (1) local emotion-critical segments are first identified within textual, acoustic, and visual modalities; (2) a hotspot-gated fusion mechanism adaptively weights local hotspots against global contextual representations; (3) a routing-based hybrid aligner enables fine-grained cross-modal alignment; and (4) a dialogue-structure graph models inter-utterance dependencies. The method integrates local-global feature modeling, graph neural networks, dynamic attention, and gating mechanisms. Evaluated on standard benchmarks, our approach significantly outperforms strong baselines. Ablation studies confirm the effectiveness of each component. Overall, the framework achieves robust, interpretable, fine-grained emotion recognition by jointly leveraging modality-specific saliency and structured conversational context.

Technology Category

Application Category

๐Ÿ“ Abstract
Emotion Recognition in Conversations (ERC) is hard because discriminative evidence is sparse, localized, and often asynchronous across modalities. We center ERC on emotion hotspots and present a unified model that detects per-utterance hotspots in text, audio, and video, fuses them with global features via Hotspot-Gated Fusion, and aligns modalities using a routed Mixture-of-Aligners; a cross-modal graph encodes conversational structure. This design focuses modeling on salient spans, mitigates misalignment, and preserves context. Experiments on standard ERC benchmarks show consistent gains over strong baselines, with ablations confirming the contributions of HGF and MoA. Our results point to a hotspot-centric view that can inform future multimodal learning, offering a new perspective on modality fusion in ERC.
Problem

Research questions and friction points this paper is trying to address.

Detecting localized emotion hotspots across text, audio, and video modalities
Fusing local hotspots with global features using gated mechanisms
Aligning asynchronous multimodal data through cross-modal routing and graphs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Detects multimodal emotion hotspots per utterance
Fuses local-global features via Hotspot-Gated Fusion
Aligns modalities using routed Mixture-of-Aligners
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Yu Liu
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences
H
Hanlei Shi
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences
H
Haoxun Li
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences
Y
Yuqing Sun
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences
Yuxuan Ding
Yuxuan Ding
Qualcomm AI Research
Vision-and-LanguageLarge Language ModelEfficient AI
L
Linlin Gong
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences
Leyuan Qu
Leyuan Qu
Hangzhou Institute for Advanced Study, UCAS
Speech Representation LearningMulti-modal Learning and Affective Computing
T
Taihao Li
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences