🤖 AI Summary
Arabic speech emotion recognition (SER) has long suffered from data scarcity and high computational overhead of existing models. To address these challenges, this paper proposes a lightweight hybrid architecture that takes Mel-spectrogram images as input and integrates 2D convolutional neural networks (CNNs), bidirectional long short-term memory (BiLSTM) networks, and self-attention mechanisms—deliberately omitting conventional MFCC features and 1D convolutions to enhance fine-grained spatio-temporal emotional feature modeling. The resulting model contains only ~1 million parameters—approximately 1/90 the size of HuBERT-base and 1/74 that of Whisper—enabling efficient deployment on resource-constrained devices. Evaluated on major Arabic SER benchmarks, it achieves state-of-the-art (SOTA) accuracy. This work thus establishes a highly accurate yet computationally efficient framework for SER in low-resource languages, offering a practical and scalable solution for real-world deployment.
📝 Abstract
Speech emotion recognition is vital for human-computer interaction, particularly for low-resource languages like Arabic, which face challenges due to limited data and research. We introduce ArabEmoNet, a lightweight architecture designed to overcome these limitations and deliver state-of-the-art performance. Unlike previous systems relying on discrete MFCC features and 1D convolutions, which miss nuanced spectro-temporal patterns, ArabEmoNet uses Mel spectrograms processed through 2D convolutions, preserving critical emotional cues often lost in traditional methods.
While recent models favor large-scale architectures with millions of parameters, ArabEmoNet achieves superior results with just 1 million parameters, 90 times smaller than HuBERT base and 74 times smaller than Whisper. This efficiency makes it ideal for resource-constrained environments. ArabEmoNet advances Arabic speech emotion recognition, offering exceptional performance and accessibility for real-world applications.