Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs

📅 2026-01-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the persistent challenge of machine translation systems in handling the diverse dialectal variants of Arabic, which hinders effective communication for millions of native speakers. To bridge this gap, the authors introduce the Alexandria dataset—a large-scale, community-driven resource comprising 107,000 human-translated utterances from multi-turn dialogues across 13 Arabic-speaking countries and 11 high-impact domains. Alexandria uniquely incorporates fine-grained city-level dialect annotations and explicit speaker–listener gender configurations, moving beyond conventional coarse-grained regional labels. The dataset is validated through a dual-track evaluation framework combining automated metrics and human assessment. Experimental results demonstrate significant shortcomings of current large language models in dialectal Arabic translation, establishing Alexandria as a high-quality benchmark for future research in modeling, training, and evaluating systems for this linguistically complex setting.

Technology Category

Application Category

📝 Abstract
Arabic is a highly diglossic language where most daily communication occurs in regional dialects rather than Modern Standard Arabic. Despite this, machine translation (MT) systems often generalize poorly to dialectal input, limiting their utility for millions of speakers. We introduce \textbf{Alexandria}, a large-scale, community-driven, human-translated dataset designed to bridge this gap. Alexandria covers 13 Arab countries and 11 high-impact domains, including health, education, and agriculture. Unlike previous resources, Alexandria provides unprecedented granularity by associating contributions with city-of-origin metadata, capturing authentic local varieties beyond coarse regional labels. The dataset consists of multi-turn conversational scenarios annotated with speaker-addressee gender configurations, enabling the study of gender-conditioned variation in dialectal use. Comprising 107K total samples, Alexandria serves as both a training resource and a rigorous benchmark for evaluating MT and Large Language Models (LLMs). Our automatic and human evaluation of Arabic-aware LLMs benchmarks current capabilities in translating across diverse Arabic dialects and sub-dialects, while exposing significant persistent challenges.
Problem

Research questions and friction points this paper is trying to address.

dialectal Arabic
machine translation
linguistic diversity
diglossia
Arabic NLP
Innovation

Methods, ideas, or system contributions that make the work stand out.

dialectal Arabic
fine-grained geographic metadata
gender-conditioned variation
multi-domain MT dataset
community-driven translation
🔎 Similar Papers
No similar papers found.
A
Abdellah El Mekki
The University of British Columbia
S
S. Magdy
The University of British Columbia
H
Houdaifa Atou
Mohammed VI Polytechnic University
R
Ruwa AbuHweidi
Birzeit University
B
Baraah Qawasmeh
Western Michigan University
O
Omer Nacar
Tuwaiq Academy
T
Thikra Al-hibiri
King Khalid University
R
Razan Saadie
Jordan University of Science and Technology
H
Hamzah A. Alsayadi
Ibb University
N
Nadia Ghezaiel Hammouda
University of Hail
A
Alshima Alkhazimi
University of Technology and Applied Sciences
A
Aya Hamod
Arab Open University
A
Al-Yas Al-Ghafri
University of Technology and Applied Sciences
W
Wesam El-Sayed
Minia University
A
Asila Al sharji
University of Technology and Applied Sciences
Mohamad Ballout
Mohamad Ballout
PhD in Cognitive Science, University of Osnabrück
Computer VisionDeep LearningCognitive Science
A
Anas Belfathi
Nantes University
K
Karim Ghaddar
American University of Beirut
Serry Sibaee
Serry Sibaee
Research Engineer
Arabic Natural Language processingNLP
A
Alaa Aoun
Jordan University of Science and Technology
A
A. Asiri
King Khalid University
L
Lina Abureesh
Birzeit University
A
Ahlam Bashiti
Birzeit University
M
Majdal Yousef
Birzeit University
A
Abdulaziz Hafiz
Umm Al-Qura University
Y
Yehdih Mohamed
University of Nouakchott
E
Emira Hamedtou
University of Nouakchott
B
Brakehe Brahim
University of Nouakchott
R
Rahaf Alhamouri
Jordan University of Science and Technology
Youssef Nafea
Youssef Nafea
Masters Student at MBZUAI
Deep LearningLLMsNatural Language ProcessingSpeech Processing
A
Aya El Aatar
Mohammed VI Polytechnic University
W
Walid Al-Dhabyani
Hadhramout University; Cairo University
E
Emhemed Hamed
Misurata University
S
Sara Shatnawi
Al-Balqa Applied University
Fakhraddin Alwajih
Fakhraddin Alwajih
Postdoctoral Fellow Researcher @ UBC
Artificial Intelligence Machine Learning Natural Language Processing
K
Khalid Elkhidir
University of Khartoum
A
A. Alasmari
King Khalid University
A
Abdurrahman Gerrio
Misurata University
O
Omar Said Alshahri
Sultan Qaboos Higher Centre for Culture and Science
A
AbdelRahim Elmadany
The University of British Columbia
I
Ismail Berrada
Mohammed VI Polytechnic University
A
Amir Azad Adli Alkathiri
University of Technology and Applied Sciences
F
Fadi A. Zaraket
American University of Beirut
Mustafa Jarrar
Mustafa Jarrar
Professor, Hamad Bin Khalifa University, Qatar - Birzeit University, Palestine
Arabic Natural Language ProcessingSocial ComputingOntology EngineeringKnowledge Graphs
Y
Yahya Ould Mohamed El Hadj
Arab Center for Research and Policy Studies
Hassan Alhuzali
Hassan Alhuzali
Assistant Professor @ Umm Al-Qura University
natural language processingcultural awareness of LLMsMental Healthaffective computing
M
M. Abdul-Mageed
The University of British Columbia, Canada Research Chair in NLP and ML