🤖 AI Summary
To address the low robustness of speech emotion recognition (SER) in natural spontaneous speech—characterized by subtle emotional expressions and strong noise interference—this paper proposes a multimodal graph attention fusion framework. Methodologically, we first systematically validate the effectiveness of F0 quantization for SER in naturalistic scenarios; second, we design a graph attention network (GAT)-driven cross-modal fusion mechanism integrating acoustic, ASR-derived textual, and prosody-spectral features to enhance sparse emotion modeling; third, we incorporate a pretrained audio tagging model (PANNs) and multi-model ensemble to improve generalization. On the official Interspeech 2025 test set, our approach achieves a Macro F1 score of 39.79% (42.20% on the validation set). Ablation studies confirm that the GAT-based fusion yields significant performance gains over conventional fusion methods.
📝 Abstract
Training SER models in natural, spontaneous speech is especially challenging due to the subtle expression of emotions and the unpredictable nature of real-world audio. In this paper, we present a robust system for the INTERSPEECH 2025 Speech Emotion Recognition in Naturalistic Conditions Challenge, focusing on categorical emotion recognition. Our method combines state-of-the-art audio models with text features enriched by prosodic and spectral cues. In particular, we investigate the effectiveness of Fundamental Frequency (F0) quantization and the use of a pretrained audio tagging model. We also employ an ensemble model to improve robustness. On the official test set, our system achieved a Macro F1-score of 39.79% (42.20% on validation). Our results underscore the potential of these methods, and analysis of fusion techniques confirmed the effectiveness of Graph Attention Networks. Our source code is publicly available.