EgMM-Corpus: A Multimodal Vision-Language Dataset for Egyptian Culture

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the underrepresentation of Middle Eastern and African cultural semantics and the strong Western bias prevalent in existing multimodal datasets, this work introduces EGY-VL—the first high-quality, Egypt-specific vision-language dataset. It comprises over 3,000 culturally authentic images annotated with 313 indigenous concepts spanning landmarks, cuisine, folklore, and traditional practices; all samples undergo dual human verification for cultural fidelity and image–text alignment. Zero-shot classification evaluation on EGY-VL reveals that CLIP achieves only 21.2% Top-1 accuracy, exposing a critical systemic gap in mainstream foundation models’ non-Western cultural understanding. EGY-VL thus establishes the first regional cultural benchmark for multimodal AI evaluation and provides a rigorous, community-grounded standard to assess and advance cultural inclusivity in vision-language models.

Technology Category

Application Category

📝 Abstract
Despite recent advances in AI, multimodal culturally diverse datasets are still limited, particularly for regions in the Middle East and Africa. In this paper, we introduce EgMM-Corpus, a multimodal dataset dedicated to Egyptian culture. By designing and running a new data collection pipeline, we collected over 3,000 images, covering 313 concepts across landmarks, food, and folklore. Each entry in the dataset is manually validated for cultural authenticity and multimodal coherence. EgMM-Corpus aims to provide a reliable resource for evaluating and training vision-language models in an Egyptian cultural context. We further evaluate the zero-shot performance of Contrastive Language-Image Pre-training CLIP on EgMM-Corpus, on which it achieves 21.2% Top-1 accuracy and 36.4% Top-5 accuracy in classification. These results underscore the existing cultural bias in large-scale vision-language models and demonstrate the importance of EgMM-Corpus as a benchmark for developing culturally aware models.
Problem

Research questions and friction points this paper is trying to address.

Addressing limited multimodal datasets for Middle Eastern cultures
Providing culturally authentic Egyptian vision-language dataset
Evaluating cultural bias in vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Designed new data collection pipeline for multimodal dataset
Collected 3000 culturally authentic Egyptian images manually
Evaluated CLIP model performance on Egyptian cultural benchmark
🔎 Similar Papers
No similar papers found.
M
Mohamed Gamil
Department of Computer Science and Engineering, Egypt-Japan University of Science and Technology (E-JUST), Alexandria 21934, Egypt
A
Abdelrahman Elsayed
Department of Computer Science and Engineering, Egypt-Japan University of Science and Technology (E-JUST), Alexandria 21934, Egypt
A
Abdelrahman Lila
Department of Computer Science and Engineering, Egypt-Japan University of Science and Technology (E-JUST), Alexandria 21934, Egypt
A
Ahmed Gad
Department of Computer Science and Engineering, Egypt-Japan University of Science and Technology (E-JUST), Alexandria 21934, Egypt
H
Hesham Abdelgawad
Department of Computer Science and Engineering, Egypt-Japan University of Science and Technology (E-JUST), Alexandria 21934, Egypt
M
Mohamed Aref
Department of Computer Science and Engineering, Egypt-Japan University of Science and Technology (E-JUST), Alexandria 21934, Egypt
Ahmed Fares
Ahmed Fares
Assoc. Prof. of Computer Sci. and Eng., E-JUST
NeuroinformaticsBioinformaticsNeuroscienceMachine learningDeep learning