ADI-20: Arabic Dialect Identification dataset and models

📅 2025-08-17

🏛️ Interspeech

📈 Citations: 2

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses Arabic Dialect Identification (ADI), a challenging multilingual speech classification task. We introduce ADI-20, the first large-scale, publicly available dataset covering all 22 Arab League countries’ dialects plus Modern Standard Arabic (MSA), comprising 19 dialects and 3,556 hours of speech. We propose an end-to-end ADI model built upon the ECAPA-TDNN backbone, enhanced with Whisper encoder blocks, attention-based pooling, and a dialect-specific classification head. To our knowledge, this is the first open, reproducible framework for pan-Arabic dialect modeling—releasing data, models, and training code. Experiments demonstrate state-of-the-art performance, exceptional data efficiency (F1 drops by <1.5% when trained on only 30% of the data), and systematic empirical analysis of the scaling relationships between dataset size, model parameters, and ADI accuracy.

Technology Category

Application Category

📝 Abstract

We present ADI-20, an extension of the previously published ADI-17 Arabic Dialect Identification (ADI) dataset. ADI-20 covers all Arabic-speaking countries'dialects. It comprises 3,556 hours from 19 Arabic dialects in addition to Modern Standard Arabic (MSA). We used this dataset to train and evaluate various state-of-the-art ADI systems. We explored fine-tuning pre-trained ECAPA-TDNN-based models, as well as Whisper encoder blocks coupled with an attention pooling layer and a classification dense layer. We investigated the effect of (i) training data size and (ii) the model's number of parameters on identification performance. Our results show a small decrease in F1 score while using only 30% of the original training data. We open-source our collected data and trained models to enable the reproduction of our work, as well as support further research in ADI.

Problem

Research questions and friction points this paper is trying to address.

Extending Arabic dialect identification dataset to cover all countries

Evaluating performance of state-of-the-art dialect identification systems

Investigating impact of training data size and model parameters

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned pre-trained ECAPA-TDNN models

Used Whisper encoder with attention pooling

Evaluated training data size impact performance

🔎 Similar Papers

No similar papers found.