๐ค AI Summary
Emotion Recognition in Conversations (ERC) faces challenges stemming from sparse, localized, and asynchronous multimodal evidence. To address these, we propose a multimodal ERC framework centered on โemotion hotspotsโ: (1) local emotion-critical segments are first identified within textual, acoustic, and visual modalities; (2) a hotspot-gated fusion mechanism adaptively weights local hotspots against global contextual representations; (3) a routing-based hybrid aligner enables fine-grained cross-modal alignment; and (4) a dialogue-structure graph models inter-utterance dependencies. The method integrates local-global feature modeling, graph neural networks, dynamic attention, and gating mechanisms. Evaluated on standard benchmarks, our approach significantly outperforms strong baselines. Ablation studies confirm the effectiveness of each component. Overall, the framework achieves robust, interpretable, fine-grained emotion recognition by jointly leveraging modality-specific saliency and structured conversational context.
๐ Abstract
Emotion Recognition in Conversations (ERC) is hard because discriminative evidence is sparse, localized, and often asynchronous across modalities. We center ERC on emotion hotspots and present a unified model that detects per-utterance hotspots in text, audio, and video, fuses them with global features via Hotspot-Gated Fusion, and aligns modalities using a routed Mixture-of-Aligners; a cross-modal graph encodes conversational structure. This design focuses modeling on salient spans, mitigates misalignment, and preserves context. Experiments on standard ERC benchmarks show consistent gains over strong baselines, with ablations confirming the contributions of HGF and MoA. Our results point to a hotspot-centric view that can inform future multimodal learning, offering a new perspective on modality fusion in ERC.