🤖 AI Summary
Low-resource language evaluation is severely underrepresented, and existing multilingual benchmarks rely heavily on machine translation, introducing cultural distortions. Method: This paper introduces SinhalaMMLU—the first culturally adapted, multi-task understanding benchmark for Sinhala—built from Sri Lanka’s national curriculum, covering six domains and 30 subjects with over 7,000 manually authored multiple-choice questions to eliminate translation bias and assess culture-specific knowledge. Contribution/Results: We systematically evaluate 26 state-of-the-art LLMs using accuracy as the primary metric. Results show Claude 3.5 Sonnet (67%) and GPT-4o (62%) achieve the highest performance, yet overall scores remain low—particularly in humanities and other culture-intensive tasks. This work establishes the first education-driven, locally grounded evaluation paradigm for low-resource languages, offering both a new benchmark and methodological insights for culturally equitable LLM assessment.
📝 Abstract
Large Language Models (LLMs) demonstrate impressive general knowledge and reasoning abilities, yet their evaluation has predominantly focused on global or anglocentric subjects, often neglecting low-resource languages and culturally specific content. While recent multilingual benchmarks attempt to bridge this gap, many rely on automatic translation, which can introduce errors and misrepresent the original cultural context. To address this, we introduce SinhalaMMLU, the first multiple-choice question answering benchmark designed specifically for Sinhala, a low-resource language. The dataset includes over 7,000 questions spanning secondary to collegiate education levels, aligned with the Sri Lankan national curriculum, and covers six domains and 30 subjects, encompassing both general academic topics and culturally grounded knowledge. We evaluate 26 LLMs on SinhalaMMLU and observe that, while Claude 3.5 sonnet and GPT-4o achieve the highest average accuracies at 67% and 62% respectively, overall model performance remains limited. In particular, models struggle in culturally rich domains such as the Humanities, revealing substantial room for improvement in adapting LLMs to low-resource and culturally specific contexts.