Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

135K/year

🤖 AI Summary

Existing multilingual LLM factuality evaluations suffer from narrow knowledge coverage, subjective answer annotation, and temporal instability. Method: We propose KoLasSimpleQA—the first fine-grained benchmark for multilingual fact memory and self-awareness—covering nine languages and introducing a novel “general + language-specific” dual-domain design. It employs a small-shot factual question construction paradigm grounded in single-knowledge-point queries, objectively unique answers, and temporal stability. Knowledge-guided fine-grained question generation and LLM-as-judge automated evaluation enable a two-dimensional evaluation framework balancing breadth (multilinguality) and depth (cross-domain coverage). Contribution/Results: Experiments reveal systematic disparities in multilingual factuality across mainstream LLMs and large reasoning models. KoLasSimpleQA significantly improves assessment efficacy in accuracy, ranking consistency, calibration, and robustness, offering a reproducible benchmark for delineating multilingual capability boundaries and guiding model optimization.

Technology Category

Application Category

📝 Abstract

We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs). Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability. These questions enable efficient evaluation using the LLM-as-judge paradigm, testing both the LLMs' factual memory and self-awareness ("know what they don't know"). KoLasSimpleQA expands existing research in two key dimensions: (1) Breadth (Multilingual Coverage): It includes 9 languages, supporting global applicability evaluation. (2) Depth (Dual Domain Design): It covers both the general domain (global facts) and the language-specific domain (such as history, culture, and regional traditions) for a comprehensive assessment of multilingual capabilities. We evaluated mainstream LLMs, including traditional LLM and emerging Large Reasoning Models. Results show significant performance differences between the two domains, particularly in performance metrics, ranking, calibration, and robustness. This highlights the need for targeted evaluation and optimization in multilingual contexts. We hope KoLasSimpleQA will help the research community better identify LLM capability boundaries in multilingual contexts and provide guidance for model optimization. We will release KoLasSimpleQA at https://github.com/opendatalab/KoLasSimpleQA .

Problem

Research questions and friction points this paper is trying to address.

Evaluates multilingual factual ability of Large Language Models

Tests LLMs' factual memory and self-awareness boundaries

Assesses performance gaps in general and language-specific domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual benchmark KoLasSimpleQA for LLMs

Dual domain design for comprehensive evaluation

LLM-as-judge paradigm for efficient testing

🔎 Similar Papers

No similar papers found.

Scale AI

$264,800—$331,000 USD

San Francisco / New York / Seattle

Research Scientist Intern, Multimodal AI (PhD)