MSF-SER: Enriching Acoustic Modeling with Multi-Granularity Semantics for Speech Emotion Recognition

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing continuous-dimensional speech emotion recognition (SER) methods suffer from two key limitations: (1) reliance solely on global text transcriptions, neglecting the dynamic influence of local prosodic emphasis on emotional semantics; and (2) semantic representations confined to lexical-level features, lacking high-level, interpretable cues. To address these, we propose MSF-SER, a Multi-Granularity Semantic Fusion framework that jointly models local emphasis semantics, BERT-encoded global semantics, and knowledge-enhanced extended semantics—marking the first unified integration of such complementary semantic sources. MSF-SER employs a lightweight Mixture-of-Experts architecture modulated via Feature-wise Linear Modulation (FiLM), coupled with a gated intra-modal fusion mechanism, enabling dynamic capture of fine-grained emotional features and cross-modal synergy. Extensive experiments on MSP-Podcast and IEMOCAP demonstrate significant improvements in valence and arousal prediction accuracy. Ablation studies confirm that multi-granularity semantic fusion enhances both interpretability and performance of continuous SER.

Technology Category

Application Category

📝 Abstract
Continuous dimensional speech emotion recognition captures affective variation along valence, arousal, and dominance, providing finer-grained representations than categorical approaches. Yet most multimodal methods rely solely on global transcripts, leading to two limitations: (1) all words are treated equally, overlooking that emphasis on different parts of a sentence can shift emotional meaning; (2) only surface lexical content is represented, lacking higher-level interpretive cues. To overcome these issues, we propose MSF-SER (Multi-granularity Semantic Fusion for Speech Emotion Recognition), which augments acoustic features with three complementary levels of textual semantics--Local Emphasized Semantics (LES), Global Semantics (GS), and Extended Semantics (ES). These are integrated via an intra-modal gated fusion and a cross-modal FiLM-modulated lightweight Mixture-of-Experts (FM-MOE). Experiments on MSP-Podcast and IEMOCAP show that MSF-SER consistently improves dimensional prediction, demonstrating the effectiveness of enriched semantic fusion for SER.
Problem

Research questions and friction points this paper is trying to address.

Addresses limitations of global transcripts in emotion recognition
Enriches acoustic features with multi-granularity semantic representations
Improves dimensional emotion prediction through semantic fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Augments acoustic features with multi-granularity textual semantics
Integrates semantics via intra-modal gated fusion
Uses cross-modal FiLM-modulated lightweight Mixture-of-Experts
🔎 Similar Papers
No similar papers found.
H
Haoxun Li
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences
Y
Yuqing Sun
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences
H
Hanlei Shi
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences
Y
Yu Liu
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences
Leyuan Qu
Leyuan Qu
Hangzhou Institute for Advanced Study, UCAS
Speech Representation LearningMulti-modal Learning and Affective Computing
T
Taihao Li
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences