🤖 AI Summary
Existing natural question-answering (QA) benchmarks lack native-speaker-driven design and region-specific cultural alignment, hindering fine-grained evaluation and adaptation of large language models (LLMs) along cultural–linguistic dimensions. To address this, we propose NativQA—a language-agnostic, scalable framework—and introduce MultiNativQA, the first native-user-driven, multilingual, regionally culturally aligned natural QA benchmark. It spans seven languages (including extremely low-resource ones), nine geographic regions, and 18 thematic domains, comprising ~64k high-quality samples. Data collection integrates cross-regional native user queries, expert annotation, cultural sensitivity validation, and multilingual consistency alignment. We conduct systematic evaluations of leading open- and closed-source LLMs on MultiNativQA, revealing—for the first time—substantial performance disparities across regional cultural contexts. Both the benchmark dataset and implementation code are fully open-sourced.
📝 Abstract
Natural Question Answering (QA) datasets play a crucial role in evaluating the capabilities of large language models (LLMs), ensuring their effectiveness in real-world applications. Despite the numerous QA datasets that have been developed and some work has been done in parallel, there is a notable lack of a framework and large scale region-specific datasets queried by native users in their own languages. This gap hinders the effective benchmarking and the development of fine-tuned models for regional and cultural specificities. In this study, we propose a scalable, language-independent framework, NativQA, to seamlessly construct culturally and regionally aligned QA datasets in native languages, for LLM evaluation and tuning. We demonstrate the efficacy of the proposed framework by designing a multilingual natural QA dataset, MultiNativQA, consisting of ~64k manually annotated QA pairs in seven languages, ranging from high to extremely low resource, based on queries from native speakers from 9 regions covering 18 topics. We benchmark open- and closed-source LLMs with the MultiNativQA dataset. We made the MultiNativQA dataset(https://huggingface.co/datasets/QCRI/MultiNativQA), and other experimental scripts(https://gitlab.com/nativqa/multinativqa) publicly available for the community.