A Novel Hybrid Deep Learning Technique for Speech Emotion Detection using Feature Engineering

📅 2025-07-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limited generalization capability of speech emotion recognition (SER) across multiple datasets and seven emotion classes (neutral, happy, sad, angry, fearful, disgusted, surprised). We propose a hybrid DCRF-BiLSTM model—the first to integrate Deep Convolutional Recurrent Networks (DCRF) with Bidirectional Long Short-Term Memory (BiLSTM) networks—enhanced by optimized acoustic feature engineering to improve time-frequency representation learning. Our key contribution is the unified evaluation of five major benchmark datasets—RAVDESS, TESS, SAVEE, EmoDB, and Crema-D, enabling rigorous assessment of cross-lingual and cross-recording-condition robustness. Experimental results demonstrate state-of-the-art performance: per-dataset accuracy reaches 97.83%–100%; accuracy on the RAVDESS+TESS+SAVEE combination is 98.82%; and joint evaluation across all five datasets achieves an overall accuracy of 93.76%, significantly surpassing existing methods.

Technology Category

Application Category

📝 Abstract
Nowadays, speech emotion recognition (SER) plays a vital role in the field of human-computer interaction (HCI) and the evolution of artificial intelligence (AI). Our proposed DCRF-BiLSTM model is used to recognize seven emotions: neutral, happy, sad, angry, fear, disgust, and surprise, which are trained on five datasets: RAVDESS (R), TESS (T), SAVEE (S), EmoDB (E), and Crema-D (C). The model achieves high accuracy on individual datasets, including 97.83% on RAVDESS, 97.02% on SAVEE, 95.10% for CREMA-D, and a perfect 100% on both TESS and EMO-DB. For the combined (R+T+S) datasets, it achieves 98.82% accuracy, outperforming previously reported results. To our knowledge, no existing study has evaluated a single SER model across all five benchmark datasets (i.e., R+T+S+C+E) simultaneously. In our work, we introduce this comprehensive combination and achieve a remarkable overall accuracy of 93.76%. These results confirm the robustness and generalizability of our DCRF-BiLSTM framework across diverse datasets.
Problem

Research questions and friction points this paper is trying to address.

Develop hybrid deep learning for speech emotion detection
Recognize seven emotions across multiple benchmark datasets
Improve accuracy and generalizability in emotion recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid DCRF-BiLSTM model for emotion detection
Feature engineering enhances speech emotion recognition
Evaluated across five benchmark datasets simultaneously
🔎 Similar Papers
No similar papers found.
S
Shahana Yasmin Chowdhury
University of New Orleans, USA
B
Bithi Banik
Kristiania University, Norway
Md Tamjidul Hoque
Md Tamjidul Hoque
Professor of Computer Science, University of New Orleans
BioinformaticsMachine LearningArtificial Intelligence
S
Shreya Banerjee
University of New Orleans, USA