Enhancing Speech Emotion Recognition with Graph-Based Multimodal Fusion and Prosodic Features for the Speech Emotion Recognition in Naturalistic Conditions Challenge at Interspeech 2025

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low robustness of speech emotion recognition (SER) in natural spontaneous speech—characterized by subtle emotional expressions and strong noise interference—this paper proposes a multimodal graph attention fusion framework. Methodologically, we first systematically validate the effectiveness of F0 quantization for SER in naturalistic scenarios; second, we design a graph attention network (GAT)-driven cross-modal fusion mechanism integrating acoustic, ASR-derived textual, and prosody-spectral features to enhance sparse emotion modeling; third, we incorporate a pretrained audio tagging model (PANNs) and multi-model ensemble to improve generalization. On the official Interspeech 2025 test set, our approach achieves a Macro F1 score of 39.79% (42.20% on the validation set). Ablation studies confirm that the GAT-based fusion yields significant performance gains over conventional fusion methods.

Technology Category

Application Category

📝 Abstract
Training SER models in natural, spontaneous speech is especially challenging due to the subtle expression of emotions and the unpredictable nature of real-world audio. In this paper, we present a robust system for the INTERSPEECH 2025 Speech Emotion Recognition in Naturalistic Conditions Challenge, focusing on categorical emotion recognition. Our method combines state-of-the-art audio models with text features enriched by prosodic and spectral cues. In particular, we investigate the effectiveness of Fundamental Frequency (F0) quantization and the use of a pretrained audio tagging model. We also employ an ensemble model to improve robustness. On the official test set, our system achieved a Macro F1-score of 39.79% (42.20% on validation). Our results underscore the potential of these methods, and analysis of fusion techniques confirmed the effectiveness of Graph Attention Networks. Our source code is publicly available.
Problem

Research questions and friction points this paper is trying to address.

Improving speech emotion recognition in natural conditions
Combining audio and text features with prosodic cues
Evaluating graph-based fusion for robust emotion classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph Attention Networks for multimodal fusion
Fundamental Frequency quantization for prosodic features
Ensemble model to enhance robustness
🔎 Similar Papers
No similar papers found.
Alef Iury Siqueira Ferreira
Alef Iury Siqueira Ferreira
Universidade Federal de Goiás
Machine LearningDeep LearningSpeech RecognitionBioacousticsNatural Language Processing
L
L. Gris
Federal University of Goiás, Brazil
A
Alexandre Ferro Filho
Federal University of Goiás, Brazil
L
Lucas 'Olives
Federal University of Goiás, Brazil
Daniel Ribeiro
Daniel Ribeiro
Federal University of Goiás, Brazil
L
Luiz Fernando
Federal University of Goiás, Brazil
F
Fernanda Lustosa
Federal University of Rio Grande do Norte, Brazil
R
Rodrigo Tanaka
Aeronautics Institute of Technology, Brazil
F
Frederico Santos de Oliveira
Federal University of Mato Grosso, Brazil
A
Arlindo Galvão Filho
Federal University of Goiás, Brazil