🤖 AI Summary
Existing continuous-dimensional speech emotion recognition (SER) methods suffer from two key limitations: (1) reliance solely on global text transcriptions, neglecting the dynamic influence of local prosodic emphasis on emotional semantics; and (2) semantic representations confined to lexical-level features, lacking high-level, interpretable cues. To address these, we propose MSF-SER, a Multi-Granularity Semantic Fusion framework that jointly models local emphasis semantics, BERT-encoded global semantics, and knowledge-enhanced extended semantics—marking the first unified integration of such complementary semantic sources. MSF-SER employs a lightweight Mixture-of-Experts architecture modulated via Feature-wise Linear Modulation (FiLM), coupled with a gated intra-modal fusion mechanism, enabling dynamic capture of fine-grained emotional features and cross-modal synergy. Extensive experiments on MSP-Podcast and IEMOCAP demonstrate significant improvements in valence and arousal prediction accuracy. Ablation studies confirm that multi-granularity semantic fusion enhances both interpretability and performance of continuous SER.
📝 Abstract
Continuous dimensional speech emotion recognition captures affective variation along valence, arousal, and dominance, providing finer-grained representations than categorical approaches. Yet most multimodal methods rely solely on global transcripts, leading to two limitations: (1) all words are treated equally, overlooking that emphasis on different parts of a sentence can shift emotional meaning; (2) only surface lexical content is represented, lacking higher-level interpretive cues. To overcome these issues, we propose MSF-SER (Multi-granularity Semantic Fusion for Speech Emotion Recognition), which augments acoustic features with three complementary levels of textual semantics--Local Emphasized Semantics (LES), Global Semantics (GS), and Extended Semantics (ES). These are integrated via an intra-modal gated fusion and a cross-modal FiLM-modulated lightweight Mixture-of-Experts (FM-MOE). Experiments on MSP-Podcast and IEMOCAP show that MSF-SER consistently improves dimensional prediction, demonstrating the effectiveness of enriched semantic fusion for SER.