SinhalaMMLU: A Comprehensive Benchmark for Evaluating Multitask Language Understanding in Sinhala

📅 2025-09-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Low-resource language evaluation is severely underrepresented, and existing multilingual benchmarks rely heavily on machine translation, introducing cultural distortions. Method: This paper introduces SinhalaMMLU—the first culturally adapted, multi-task understanding benchmark for Sinhala—built from Sri Lanka’s national curriculum, covering six domains and 30 subjects with over 7,000 manually authored multiple-choice questions to eliminate translation bias and assess culture-specific knowledge. Contribution/Results: We systematically evaluate 26 state-of-the-art LLMs using accuracy as the primary metric. Results show Claude 3.5 Sonnet (67%) and GPT-4o (62%) achieve the highest performance, yet overall scores remain low—particularly in humanities and other culture-intensive tasks. This work establishes the first education-driven, locally grounded evaluation paradigm for low-resource languages, offering both a new benchmark and methodological insights for culturally equitable LLM assessment.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) demonstrate impressive general knowledge and reasoning abilities, yet their evaluation has predominantly focused on global or anglocentric subjects, often neglecting low-resource languages and culturally specific content. While recent multilingual benchmarks attempt to bridge this gap, many rely on automatic translation, which can introduce errors and misrepresent the original cultural context. To address this, we introduce SinhalaMMLU, the first multiple-choice question answering benchmark designed specifically for Sinhala, a low-resource language. The dataset includes over 7,000 questions spanning secondary to collegiate education levels, aligned with the Sri Lankan national curriculum, and covers six domains and 30 subjects, encompassing both general academic topics and culturally grounded knowledge. We evaluate 26 LLMs on SinhalaMMLU and observe that, while Claude 3.5 sonnet and GPT-4o achieve the highest average accuracies at 67% and 62% respectively, overall model performance remains limited. In particular, models struggle in culturally rich domains such as the Humanities, revealing substantial room for improvement in adapting LLMs to low-resource and culturally specific contexts.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on low-resource Sinhala language understanding
Assessing culturally specific knowledge beyond automatic translation
Measuring model performance on Sri Lankan curriculum domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

First multiple-choice benchmark for Sinhala language
Over 7000 curriculum-aligned questions across 30 subjects
Evaluates 26 LLMs on culturally specific knowledge
🔎 Similar Papers
No similar papers found.