Centering Emotion Hotspots: Multimodal Local-Global Fusion and Cross-Modal Alignment for Emotion Recognition in Conversations

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Emotion Recognition in Conversations (ERC) faces challenges stemming from sparse, localized, and asynchronous multimodal evidence. To address these, we propose a multimodal ERC framework centered on “emotion hotspots”: (1) local emotion-critical segments are first identified within textual, acoustic, and visual modalities; (2) a hotspot-gated fusion mechanism adaptively weights local hotspots against global contextual representations; (3) a routing-based hybrid aligner enables fine-grained cross-modal alignment; and (4) a dialogue-structure graph models inter-utterance dependencies. The method integrates local-global feature modeling, graph neural networks, dynamic attention, and gating mechanisms. Evaluated on standard benchmarks, our approach significantly outperforms strong baselines. Ablation studies confirm the effectiveness of each component. Overall, the framework achieves robust, interpretable, fine-grained emotion recognition by jointly leveraging modality-specific saliency and structured conversational context.

Technology Category

Application Category

📝 Abstract

Emotion Recognition in Conversations (ERC) is hard because discriminative evidence is sparse, localized, and often asynchronous across modalities. We center ERC on emotion hotspots and present a unified model that detects per-utterance hotspots in text, audio, and video, fuses them with global features via Hotspot-Gated Fusion, and aligns modalities using a routed Mixture-of-Aligners; a cross-modal graph encodes conversational structure. This design focuses modeling on salient spans, mitigates misalignment, and preserves context. Experiments on standard ERC benchmarks show consistent gains over strong baselines, with ablations confirming the contributions of HGF and MoA. Our results point to a hotspot-centric view that can inform future multimodal learning, offering a new perspective on modality fusion in ERC.

Problem

Research questions and friction points this paper is trying to address.

Detecting localized emotion hotspots across text, audio, and video modalities

Fusing local hotspots with global features using gated mechanisms

Aligning asynchronous multimodal data through cross-modal routing and graphs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Detects multimodal emotion hotspots per utterance

Fuses local-global features via Hotspot-Gated Fusion

Aligns modalities using routed Mixture-of-Aligners

🔎 Similar Papers

No similar papers found.

Authors to Follow