🤖 AI Summary
Evaluating large language models (LLMs) on low-resource languages like Urdu remains challenging due to the lack of comprehensive, multimodal, and task-diverse benchmarks. Method: We systematically assess seven mainstream LLMs on 17 Urdu NLP tasks—including both text- and speech-based ones—across 22 datasets and 13.8 hours of speech, under zero-shot settings. To enable rigorous evaluation, we introduce UrduBench: the first state-of-the-art-aligned, multimodal, multi-task benchmark for Urdu. Contribution/Results: Our experiments reveal that linguistic coverage quality—not parameter count—is the primary determinant of zero-shot performance; notably, Llama 3.1-8B outperforms larger models (e.g., GPT-3.5) on most tasks and even surpasses current task-specific SOTA methods. We further demonstrate the critical role of language-specific data in adapting LLMs for low-resource settings. This work quantifies the performance gap between general-purpose LLMs and specialized SOTA systems, while establishing a new paradigm emphasizing lightweight, high-fidelity language modeling.
📝 Abstract
Large Language Models (LLMs) pre-trained on multilingual data have revolutionized natural language processing research, by transitioning from languages and task specific model pipelines to a single model adapted on a variety of tasks. However majority of existing multilingual NLP benchmarks for LLMs provide evaluation data in only few languages with little linguistic diversity. In addition these benchmarks lack quality assessment against the respective state-of the art models. This study presents an in-depth examination of 7 prominent LLMs: GPT-3.5-turbo, Llama 2-7B-Chat, Llama 3.1-8B, Bloomz 3B, Bloomz 7B1, Ministral-8B and Whisper (Large, medium and small variant) across 17 tasks using 22 datasets, 13.8 hours of speech, in a zero-shot setting, and their performance against state-of-the-art (SOTA) models, has been compared and analyzed. Our experiments show that SOTA models currently outperform encoder-decoder models in majority of Urdu NLP tasks under zero-shot settings. However, comparing Llama 3.1-8B over prior version Llama 2-7B-Chat, we can deduce that with improved language coverage, LLMs can surpass these SOTA models. Our results emphasize that models with fewer parameters but richer language-specific data, like Llama 3.1-8B, often outperform larger models with lower language diversity, such as GPT-3.5, in several tasks.