MSF-SER: Enriching Acoustic Modeling with Multi-Granularity Semantics for Speech Emotion Recognition

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing continuous-dimensional speech emotion recognition (SER) methods suffer from two key limitations: (1) reliance solely on global text transcriptions, neglecting the dynamic influence of local prosodic emphasis on emotional semantics; and (2) semantic representations confined to lexical-level features, lacking high-level, interpretable cues. To address these, we propose MSF-SER, a Multi-Granularity Semantic Fusion framework that jointly models local emphasis semantics, BERT-encoded global semantics, and knowledge-enhanced extended semantics—marking the first unified integration of such complementary semantic sources. MSF-SER employs a lightweight Mixture-of-Experts architecture modulated via Feature-wise Linear Modulation (FiLM), coupled with a gated intra-modal fusion mechanism, enabling dynamic capture of fine-grained emotional features and cross-modal synergy. Extensive experiments on MSP-Podcast and IEMOCAP demonstrate significant improvements in valence and arousal prediction accuracy. Ablation studies confirm that multi-granularity semantic fusion enhances both interpretability and performance of continuous SER.

Technology Category

Application Category

📝 Abstract

Continuous dimensional speech emotion recognition captures affective variation along valence, arousal, and dominance, providing finer-grained representations than categorical approaches. Yet most multimodal methods rely solely on global transcripts, leading to two limitations: (1) all words are treated equally, overlooking that emphasis on different parts of a sentence can shift emotional meaning; (2) only surface lexical content is represented, lacking higher-level interpretive cues. To overcome these issues, we propose MSF-SER (Multi-granularity Semantic Fusion for Speech Emotion Recognition), which augments acoustic features with three complementary levels of textual semantics--Local Emphasized Semantics (LES), Global Semantics (GS), and Extended Semantics (ES). These are integrated via an intra-modal gated fusion and a cross-modal FiLM-modulated lightweight Mixture-of-Experts (FM-MOE). Experiments on MSP-Podcast and IEMOCAP show that MSF-SER consistently improves dimensional prediction, demonstrating the effectiveness of enriched semantic fusion for SER.

Problem

Research questions and friction points this paper is trying to address.

Addresses limitations of global transcripts in emotion recognition

Enriches acoustic features with multi-granularity semantic representations

Improves dimensional emotion prediction through semantic fusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Augments acoustic features with multi-granularity textual semantics

Integrates semantics via intra-modal gated fusion

Uses cross-modal FiLM-modulated lightweight Mixture-of-Experts

🔎 Similar Papers

No similar papers found.

Authors to Follow