SaudiCulture: A Benchmark for Evaluating Large Language Models Cultural Competence within Saudi Arabia

📅 2025-03-21

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large language models (LLMs) exhibit insufficient cultural awareness in Saudi Arabia’s linguistically diverse dialectal landscape and rich cultural context. Method: We introduce the first fine-grained, Saudi-specific cultural competence benchmark—covering five geographic regions and six cultural domains (e.g., cuisine, attire, festivals), incorporating open-ended, single-choice, and multiple-answer question formats, and distinguishing between commonsense and domain-specialized knowledge. We propose a novel “geographic–cultural two-dimensional decoupled evaluation framework” to isolate regional expertise. Contribution/Results: Evaluation across five state-of-the-art models—including GPT-4 and Llama 3.3—reveals a >37% average accuracy drop on region-specific questions and a 62% error rate on multiple-answer items, exposing critical deficits in localized cultural reasoning. The benchmark is constructed via expert annotation and cross-model consistency verification, advancing cultural assessment from generic to locale-specific evaluation and providing an essential empirical foundation for culturally adaptive LLM training.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing; however, they often struggle to accurately capture and reflect cultural nuances. This research addresses this challenge by focusing on Saudi Arabia, a country characterized by diverse dialects and rich cultural traditions. We introduce SaudiCulture, a novel benchmark designed to evaluate the cultural competence of LLMs within the distinct geographical and cultural contexts of Saudi Arabia. SaudiCulture is a comprehensive dataset of questions covering five major geographical regions, such as West, East, South, North, and Center, along with general questions applicable across all regions. The dataset encompasses a broad spectrum of cultural domains, including food, clothing, entertainment, celebrations, and crafts. To ensure a rigorous evaluation, SaudiCulture includes questions of varying complexity, such as open-ended, single-choice, and multiple-choice formats, with some requiring multiple correct answers. Additionally, the dataset distinguishes between common cultural knowledge and specialized regional aspects. We conduct extensive evaluations on five LLMs, such as GPT-4, Llama 3.3, FANAR, Jais, and AceGPT, analyzing their performance across different question types and cultural contexts. Our findings reveal that all models experience significant performance declines when faced with highly specialized or region-specific questions, particularly those requiring multiple correct responses. Additionally, certain cultural categories are more easily identifiable than others, further highlighting inconsistencies in LLMs cultural understanding. These results emphasize the importance of incorporating region-specific knowledge into LLMs training to enhance their cultural competence.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' cultural competence in Saudi Arabia's diverse contexts

Assessing LLMs' accuracy in regional dialects and cultural traditions

Identifying performance gaps in region-specific and specialized cultural questions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces SaudiCulture benchmark for cultural evaluation

Covers diverse Saudi regions and cultural domains

Tests LLMs on region-specific and complex questions

🔎 Similar Papers

No similar papers found.

Authors to Follow