🤖 AI Summary
This study addresses the challenge of jointly modeling fine-grained and global features in speech emotion recognition (SER). We propose the first application of High-Resolution Network (HRNet) to SER: spectrograms are used as input, and HRNet preserves high-resolution representations throughout the network via parallel multi-scale branches, enabling concurrent learning of local acoustic details and global semantic context under end-to-end training. Our key contribution lies in overcoming the inherent resolution degradation and fine-detail loss in conventional SER architectures that rely on progressive downsampling, thereby establishing a novel feature representation paradigm that simultaneously ensures spatial fidelity and semantic richness. The method achieves state-of-the-art accuracy of 92.45%, 80.06%, and 92.77% on RAVDESS, IEMOCAP, and EMOVO, respectively—significantly outperforming mainstream models—and sets a new benchmark and research direction for SER.
📝 Abstract
Speech emotion recognition (SER) is pivotal for enhancing human-machine interactions. This paper introduces "EmoHRNet", a novel adaptation of High-Resolution Networks (HRNet) tailored for SER. The HRNet structure is designed to maintain high-resolution representations from the initial to the final layers. By transforming audio samples into spectrograms, EmoHRNet leverages the HRNet architecture to extract high-level features. EmoHRNet's unique architecture maintains high-resolution representations throughout, capturing both granular and overarching emotional cues from speech signals. The model outperforms leading models, achieving accuracies of 92.45% on RAVDESS, 80.06% on IEMOCAP, and 92.77% on EMOVO. Thus, we show that EmoHRNet sets a new benchmark in the SER domain.