A Novel Hybrid Deep Learning Technique for Speech Emotion Detection using Feature Engineering

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the limited generalization capability of speech emotion recognition (SER) across multiple datasets and seven emotion classes (neutral, happy, sad, angry, fearful, disgusted, surprised). We propose a hybrid DCRF-BiLSTM model—the first to integrate Deep Convolutional Recurrent Networks (DCRF) with Bidirectional Long Short-Term Memory (BiLSTM) networks—enhanced by optimized acoustic feature engineering to improve time-frequency representation learning. Our key contribution is the unified evaluation of five major benchmark datasets—RAVDESS, TESS, SAVEE, EmoDB, and Crema-D, enabling rigorous assessment of cross-lingual and cross-recording-condition robustness. Experimental results demonstrate state-of-the-art performance: per-dataset accuracy reaches 97.83%–100%; accuracy on the RAVDESS+TESS+SAVEE combination is 98.82%; and joint evaluation across all five datasets achieves an overall accuracy of 93.76%, significantly surpassing existing methods.

Technology Category

Application Category

📝 Abstract

Nowadays, speech emotion recognition (SER) plays a vital role in the field of human-computer interaction (HCI) and the evolution of artificial intelligence (AI). Our proposed DCRF-BiLSTM model is used to recognize seven emotions: neutral, happy, sad, angry, fear, disgust, and surprise, which are trained on five datasets: RAVDESS (R), TESS (T), SAVEE (S), EmoDB (E), and Crema-D (C). The model achieves high accuracy on individual datasets, including 97.83% on RAVDESS, 97.02% on SAVEE, 95.10% for CREMA-D, and a perfect 100% on both TESS and EMO-DB. For the combined (R+T+S) datasets, it achieves 98.82% accuracy, outperforming previously reported results. To our knowledge, no existing study has evaluated a single SER model across all five benchmark datasets (i.e., R+T+S+C+E) simultaneously. In our work, we introduce this comprehensive combination and achieve a remarkable overall accuracy of 93.76%. These results confirm the robustness and generalizability of our DCRF-BiLSTM framework across diverse datasets.

Problem

Research questions and friction points this paper is trying to address.

Develop hybrid deep learning for speech emotion detection

Recognize seven emotions across multiple benchmark datasets

Improve accuracy and generalizability in emotion recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid DCRF-BiLSTM model for emotion detection

Feature engineering enhances speech emotion recognition

Evaluated across five benchmark datasets simultaneously

🔎 Similar Papers

No similar papers found.

Authors to Follow