EmoHRNet: High-Resolution Neural Network Based Speech Emotion Recognition

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of jointly modeling fine-grained and global features in speech emotion recognition (SER). We propose the first application of High-Resolution Network (HRNet) to SER: spectrograms are used as input, and HRNet preserves high-resolution representations throughout the network via parallel multi-scale branches, enabling concurrent learning of local acoustic details and global semantic context under end-to-end training. Our key contribution lies in overcoming the inherent resolution degradation and fine-detail loss in conventional SER architectures that rely on progressive downsampling, thereby establishing a novel feature representation paradigm that simultaneously ensures spatial fidelity and semantic richness. The method achieves state-of-the-art accuracy of 92.45%, 80.06%, and 92.77% on RAVDESS, IEMOCAP, and EMOVO, respectively—significantly outperforming mainstream models—and sets a new benchmark and research direction for SER.

Technology Category

Application Category

📝 Abstract
Speech emotion recognition (SER) is pivotal for enhancing human-machine interactions. This paper introduces "EmoHRNet", a novel adaptation of High-Resolution Networks (HRNet) tailored for SER. The HRNet structure is designed to maintain high-resolution representations from the initial to the final layers. By transforming audio samples into spectrograms, EmoHRNet leverages the HRNet architecture to extract high-level features. EmoHRNet's unique architecture maintains high-resolution representations throughout, capturing both granular and overarching emotional cues from speech signals. The model outperforms leading models, achieving accuracies of 92.45% on RAVDESS, 80.06% on IEMOCAP, and 92.77% on EMOVO. Thus, we show that EmoHRNet sets a new benchmark in the SER domain.
Problem

Research questions and friction points this paper is trying to address.

Enhancing human-machine interaction through speech emotion recognition
Maintaining high-resolution representations for emotional cue extraction
Outperforming existing models on multiple benchmark emotion datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts HRNet for speech emotion recognition
Maintains high-resolution representations throughout processing
Transforms audio to spectrograms for feature extraction
🔎 Similar Papers
No similar papers found.
A
Akshay Muppidi
Stony Brook University, Department of Computer Science, Stony Brook, New York, USA
Martin Radfar
Martin Radfar
Unknown affiliation