A Survey of Large Language Models for Arabic Language and its Dialects

📅 2024-10-26

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study systematically evaluates large language models (LLMs) for Arabic—including Classical Arabic, Modern Standard Arabic, and regional dialects—identifying critical gaps in dialectal coverage, data diversity, and open-source transparency. Through a comprehensive literature review, it is the first to comparatively analyze encoder-only, decoder-only, and encoder-decoder architectures; characterize multi-source pretraining data composition; assess downstream task performance; and quantitatively measure openness across model weights, source code, and documentation. Key findings reveal that dialect modeling remains severely constrained by scarce annotated data and training biases, while over half of the surveyed models lack publicly released weights or complete technical documentation. In response, the paper proposes an inclusive modeling paradigm centered on “data-diversity-driven development and full-stack open transparency,” accompanied by an actionable research roadmap for advancing Arabic LLMs.

Technology Category

Application Category

📝 Abstract

This survey offers a comprehensive overview of Large Language Models (LLMs) designed for Arabic language and its dialects. It covers key architectures, including encoder-only, decoder-only, and encoder-decoder models, along with the datasets used for pre-training, spanning Classical Arabic, Modern Standard Arabic, and Dialectal Arabic. The study also explores monolingual, bilingual, and multilingual LLMs, analyzing their architectures and performance across downstream tasks, such as sentiment analysis, named entity recognition, and question answering. Furthermore, it assesses the openness of Arabic LLMs based on factors, such as source code availability, training data, model weights, and documentation. The survey highlights the need for more diverse dialectal datasets and attributes the importance of openness for research reproducibility and transparency. It concludes by identifying key challenges and opportunities for future research and stressing the need for more inclusive and representative models.

Problem

Research questions and friction points this paper is trying to address.

Survey of large language models for Arabic

Analysis of architectures and performance in tasks

Assessment of openness and reproducibility in models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Arabic Language Models

Diverse Dataset Usage

Open Source Evaluation

🔎 Similar Papers

No similar papers found.

Authors to Follow