Enhancing Speech Emotion Recognition with Graph-Based Multimodal Fusion and Prosodic Features for the Speech Emotion Recognition in Naturalistic Conditions Challenge at Interspeech 2025

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the low robustness of speech emotion recognition (SER) in natural spontaneous speech—characterized by subtle emotional expressions and strong noise interference—this paper proposes a multimodal graph attention fusion framework. Methodologically, we first systematically validate the effectiveness of F0 quantization for SER in naturalistic scenarios; second, we design a graph attention network (GAT)-driven cross-modal fusion mechanism integrating acoustic, ASR-derived textual, and prosody-spectral features to enhance sparse emotion modeling; third, we incorporate a pretrained audio tagging model (PANNs) and multi-model ensemble to improve generalization. On the official Interspeech 2025 test set, our approach achieves a Macro F1 score of 39.79% (42.20% on the validation set). Ablation studies confirm that the GAT-based fusion yields significant performance gains over conventional fusion methods.

Technology Category

Application Category

📝 Abstract

Training SER models in natural, spontaneous speech is especially challenging due to the subtle expression of emotions and the unpredictable nature of real-world audio. In this paper, we present a robust system for the INTERSPEECH 2025 Speech Emotion Recognition in Naturalistic Conditions Challenge, focusing on categorical emotion recognition. Our method combines state-of-the-art audio models with text features enriched by prosodic and spectral cues. In particular, we investigate the effectiveness of Fundamental Frequency (F0) quantization and the use of a pretrained audio tagging model. We also employ an ensemble model to improve robustness. On the official test set, our system achieved a Macro F1-score of 39.79% (42.20% on validation). Our results underscore the potential of these methods, and analysis of fusion techniques confirmed the effectiveness of Graph Attention Networks. Our source code is publicly available.

Problem

Research questions and friction points this paper is trying to address.

Improving speech emotion recognition in natural conditions

Combining audio and text features with prosodic cues

Evaluating graph-based fusion for robust emotion classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph Attention Networks for multimodal fusion

Fundamental Frequency quantization for prosodic features

Ensemble model to enhance robustness

🔎 Similar Papers

No similar papers found.

Authors to Follow