ArabEmoNet: A Lightweight Hybrid 2D CNN-BiLSTM Model with Attention for Robust Arabic Speech Emotion Recognition

📅 2025-09-01

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Arabic speech emotion recognition (SER) has long suffered from data scarcity and high computational overhead of existing models. To address these challenges, this paper proposes a lightweight hybrid architecture that takes Mel-spectrogram images as input and integrates 2D convolutional neural networks (CNNs), bidirectional long short-term memory (BiLSTM) networks, and self-attention mechanisms—deliberately omitting conventional MFCC features and 1D convolutions to enhance fine-grained spatio-temporal emotional feature modeling. The resulting model contains only ~1 million parameters—approximately 1/90 the size of HuBERT-base and 1/74 that of Whisper—enabling efficient deployment on resource-constrained devices. Evaluated on major Arabic SER benchmarks, it achieves state-of-the-art (SOTA) accuracy. This work thus establishes a highly accurate yet computationally efficient framework for SER in low-resource languages, offering a practical and scalable solution for real-world deployment.

Technology Category

Application Category

📝 Abstract

Speech emotion recognition is vital for human-computer interaction, particularly for low-resource languages like Arabic, which face challenges due to limited data and research. We introduce ArabEmoNet, a lightweight architecture designed to overcome these limitations and deliver state-of-the-art performance. Unlike previous systems relying on discrete MFCC features and 1D convolutions, which miss nuanced spectro-temporal patterns, ArabEmoNet uses Mel spectrograms processed through 2D convolutions, preserving critical emotional cues often lost in traditional methods. While recent models favor large-scale architectures with millions of parameters, ArabEmoNet achieves superior results with just 1 million parameters, 90 times smaller than HuBERT base and 74 times smaller than Whisper. This efficiency makes it ideal for resource-constrained environments. ArabEmoNet advances Arabic speech emotion recognition, offering exceptional performance and accessibility for real-world applications.

Problem

Research questions and friction points this paper is trying to address.

Addressing Arabic speech emotion recognition with limited data and resources

Overcoming loss of nuanced emotional cues in traditional feature extraction

Providing efficient model for resource-constrained environments compared to large architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid 2D CNN-BiLSTM with attention mechanism

Uses Mel spectrograms instead of MFCC features

Lightweight model with only 1 million parameters

🔎 Similar Papers

No similar papers found.