🤖 AI Summary
This study addresses the limited generalization capability of speech emotion recognition (SER) across multiple datasets and seven emotion classes (neutral, happy, sad, angry, fearful, disgusted, surprised). We propose a hybrid DCRF-BiLSTM model—the first to integrate Deep Convolutional Recurrent Networks (DCRF) with Bidirectional Long Short-Term Memory (BiLSTM) networks—enhanced by optimized acoustic feature engineering to improve time-frequency representation learning. Our key contribution is the unified evaluation of five major benchmark datasets—RAVDESS, TESS, SAVEE, EmoDB, and Crema-D, enabling rigorous assessment of cross-lingual and cross-recording-condition robustness. Experimental results demonstrate state-of-the-art performance: per-dataset accuracy reaches 97.83%–100%; accuracy on the RAVDESS+TESS+SAVEE combination is 98.82%; and joint evaluation across all five datasets achieves an overall accuracy of 93.76%, significantly surpassing existing methods.
📝 Abstract
Nowadays, speech emotion recognition (SER) plays a vital role in the field of human-computer interaction (HCI) and the evolution of artificial intelligence (AI). Our proposed DCRF-BiLSTM model is used to recognize seven emotions: neutral, happy, sad, angry, fear, disgust, and surprise, which are trained on five datasets: RAVDESS (R), TESS (T), SAVEE (S), EmoDB (E), and Crema-D (C). The model achieves high accuracy on individual datasets, including 97.83% on RAVDESS, 97.02% on SAVEE, 95.10% for CREMA-D, and a perfect 100% on both TESS and EMO-DB. For the combined (R+T+S) datasets, it achieves 98.82% accuracy, outperforming previously reported results. To our knowledge, no existing study has evaluated a single SER model across all five benchmark datasets (i.e., R+T+S+C+E) simultaneously. In our work, we introduce this comprehensive combination and achieve a remarkable overall accuracy of 93.76%. These results confirm the robustness and generalizability of our DCRF-BiLSTM framework across diverse datasets.