🤖 AI Summary
Existing multilingual benchmarks suffer from cultural bias, unimodal limitations, overreliance on multiple-choice formats, and insufficient coverage of extremely low-resource languages—e.g., the endangered Irish language. To address these gaps, we introduce IRLBench: the first parallel English–Irish, multimodal, culturally embedded, open-ended reasoning benchmark, constructed from authentic 2024 Irish Leaving Certificate examination papers across 12 subjects and aligned with official grading rubrics to establish a fine-grained, education-grade evaluation framework. Key contributions include: (1) moving beyond unimodal and multiple-choice paradigms to support culturally grounded long-form text generation; (2) introducing a dual-axis automated evaluation metric assessing both factual correctness and linguistic quality; and (3) releasing the dataset, evaluation code, and an extensible interface for image-based prompting. Empirical evaluation reveals that state-of-the-art LLMs achieve <80% valid output rates and only 55.8% accuracy on Irish (vs. 76.2% on English), highlighting a critical gap in low-resource multilingual assessment. IRLBench is publicly available to advance culturally robust multilingual AI research.
📝 Abstract
Recent advances in Large Language Models (LLMs) have demonstrated promising knowledge and reasoning abilities, yet their performance in multilingual and low-resource settings remains underexplored. Existing benchmarks often exhibit cultural bias, restrict evaluation to text-only, rely on multiple-choice formats, and, more importantly, are limited for extremely low-resource languages. To address these gaps, we introduce IRLBench, presented in parallel English and Irish, which is considered definitely endangered by UNESCO. Our benchmark consists of 12 representative subjects developed from the 2024 Irish Leaving Certificate exams, enabling fine-grained analysis of model capabilities across domains. By framing the task as long-form generation and leveraging the official marking scheme, it does not only support a comprehensive evaluation of correctness but also language fidelity. Our extensive experiments of leading closed-source and open-source LLMs reveal a persistent performance gap between English and Irish, in which models produce valid Irish responses less than 80% of the time, and answer correctly 55.8% of the time compared to 76.2% in English for the best-performing model. We release IRLBench (https://huggingface.co/datasets/ReliableAI/IRLBench) and an accompanying evaluation codebase (https://github.com/ReML-AI/IRLBench) to enable future research on robust, culturally aware multilingual AI development.