🤖 AI Summary
This study addresses the absence of a large-scale, education-aligned multitask language understanding benchmark for Urdu grounded in native curricula. We introduce UrduMMLU, the first non-translated Urdu evaluation dataset derived directly from local question banks and publicly available examination PDFs, encompassing 26,431 multiple-choice questions across 26 subjects (grouped into five broad domains). Data quality is ensured through dual human annotation and consensus-based filtering. We conduct 60 zero-shot evaluations across 30 large language models and few-shot experiments on four open-source models. Results show that Gemini-1.5-Flash achieves the highest performance (90.20%–90.34% accuracy), while open-source models lag by approximately 8 percentage points. Models exhibit substantially weaker performance on humanities compared to STEM subjects (a 25–40 point gap), and few-shot learning yields only marginal improvements.
📝 Abstract
Meaningful multilingual evaluation must test models in the target language and educational context. Urdu, spoken by more than 230 million people, lacks a broad MMLU-style benchmark built from native educational sources. We introduce UrduMMLU, a benchmark of 26,431 Urdu MCQs across 26 subjects and five domains, collected from native Urdu MCQ banks and public examination PDFs. Unlike translation-based resources, UrduMMLU covers both standard academic subjects and Urdu- and region-specific content. We label the exam-derived portion through dual human annotation with strict consensus filtering. We evaluate 30 LLMs under English and Urdu prompts, yielding 60 zero-shot evaluations, and further evaluate four open-source LLMs under multiple few-shot settings across both prompt languages. Gemini-3.5-Flash performs best, reaching 90.20% and 90.34% accuracy, while no other model exceeds 85%. The strongest open-source model trails by 7.79 and 8.92 points, and many models lose 25 to 40 points on Urdu-centered Humanities subjects compared with STEM. Few-shot prompting yields only modest gains. UrduMMLU shows that Urdu knowledge remains uneven in current LLMs, especially for regionally grounded content.