UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

📅 2026-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the absence of a large-scale, education-aligned multitask language understanding benchmark for Urdu grounded in native curricula. We introduce UrduMMLU, the first non-translated Urdu evaluation dataset derived directly from local question banks and publicly available examination PDFs, encompassing 26,431 multiple-choice questions across 26 subjects (grouped into five broad domains). Data quality is ensured through dual human annotation and consensus-based filtering. We conduct 60 zero-shot evaluations across 30 large language models and few-shot experiments on four open-source models. Results show that Gemini-1.5-Flash achieves the highest performance (90.20%–90.34% accuracy), while open-source models lag by approximately 8 percentage points. Models exhibit substantially weaker performance on humanities compared to STEM subjects (a 25–40 point gap), and few-shot learning yields only marginal improvements.
📝 Abstract
Meaningful multilingual evaluation must test models in the target language and educational context. Urdu, spoken by more than 230 million people, lacks a broad MMLU-style benchmark built from native educational sources. We introduce UrduMMLU, a benchmark of 26,431 Urdu MCQs across 26 subjects and five domains, collected from native Urdu MCQ banks and public examination PDFs. Unlike translation-based resources, UrduMMLU covers both standard academic subjects and Urdu- and region-specific content. We label the exam-derived portion through dual human annotation with strict consensus filtering. We evaluate 30 LLMs under English and Urdu prompts, yielding 60 zero-shot evaluations, and further evaluate four open-source LLMs under multiple few-shot settings across both prompt languages. Gemini-3.5-Flash performs best, reaching 90.20% and 90.34% accuracy, while no other model exceeds 85%. The strongest open-source model trails by 7.79 and 8.92 points, and many models lose 25 to 40 points on Urdu-centered Humanities subjects compared with STEM. Few-shot prompting yields only modest gains. UrduMMLU shows that Urdu knowledge remains uneven in current LLMs, especially for regionally grounded content.
Problem

Research questions and friction points this paper is trying to address.

Urdu
multilingual evaluation
MMLU-style benchmark
language understanding
educational context
Innovation

Methods, ideas, or system contributions that make the work stand out.

UrduMMLU
multilingual benchmark
native educational resources
human annotation
zero-shot evaluation