🤖 AI Summary
This work addresses Arabic Dialect Identification (ADI), a challenging multilingual speech classification task. We introduce ADI-20, the first large-scale, publicly available dataset covering all 22 Arab League countries’ dialects plus Modern Standard Arabic (MSA), comprising 19 dialects and 3,556 hours of speech. We propose an end-to-end ADI model built upon the ECAPA-TDNN backbone, enhanced with Whisper encoder blocks, attention-based pooling, and a dialect-specific classification head. To our knowledge, this is the first open, reproducible framework for pan-Arabic dialect modeling—releasing data, models, and training code. Experiments demonstrate state-of-the-art performance, exceptional data efficiency (F1 drops by <1.5% when trained on only 30% of the data), and systematic empirical analysis of the scaling relationships between dataset size, model parameters, and ADI accuracy.
📝 Abstract
We present ADI-20, an extension of the previously published ADI-17 Arabic Dialect Identification (ADI) dataset. ADI-20 covers all Arabic-speaking countries'dialects. It comprises 3,556 hours from 19 Arabic dialects in addition to Modern Standard Arabic (MSA). We used this dataset to train and evaluate various state-of-the-art ADI systems. We explored fine-tuning pre-trained ECAPA-TDNN-based models, as well as Whisper encoder blocks coupled with an attention pooling layer and a classification dense layer. We investigated the effect of (i) training data size and (ii) the model's number of parameters on identification performance. Our results show a small decrease in F1 score while using only 30% of the original training data. We open-source our collected data and trained models to enable the reproduction of our work, as well as support further research in ADI.