FarsEval-PKBETS: A new diverse benchmark for evaluating Persian large language models

📅 2025-04-20

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing Persian large language models (LLMs) lack systematic, culturally grounded evaluation benchmarks. Method: We introduce FarsEval-PKBETS—the first comprehensive, culture-adapted benchmark for Persian LLMs—covering ten high-relevance domains including medicine, law, religion, and ethics, with 4,000 multi-format items (multiple-choice, short-answer, descriptive). It integrates Persian linguistic features, Iranian sociocultural context, and domain-specific local knowledge. We propose a novel evaluation framework balancing linguistic rigor, cultural fidelity, and real-world task difficulty. Contribution/Results: Calibrated across Llama3-70B, PersianMind, and Dorna, item-level accuracy averages below 50%, confirming high discriminative power. Experiments reveal critical weaknesses in Persian LLMs across complex reasoning, culturally sensitive tasks, and specialized domain knowledge—effectively addressing the scarcity of rigorous evaluation resources for low-resource languages.

Technology Category

Application Category

📝 Abstract

Research on evaluating and analyzing large language models (LLMs) has been extensive for resource-rich languages such as English, yet their performance in languages such as Persian has received considerably less attention. This paper introduces FarsEval-PKBETS benchmark, a subset of FarsEval project for evaluating large language models in Persian. This benchmark consists of 4000 questions and answers in various formats, including multiple choice, short answer and descriptive responses. It covers a wide range of domains and tasks,including medicine, law, religion, Persian language, encyclopedic knowledge, human preferences, social knowledge, ethics and bias, text generation, and respecting others' rights. This bechmark incorporates linguistics, cultural, and local considerations relevant to the Persian language and Iran. To ensure the questions are challenging for current LLMs, three models -- Llama3-70B, PersianMind, and Dorna -- were evaluated using this benchmark. Their average accuracy was below 50%, meaning they provided fully correct answers to fewer than half of the questions. These results indicate that current language models are still far from being able to solve this benchmark

Problem

Research questions and friction points this paper is trying to address.

Evaluating Persian LLMs lacks benchmarks compared to English

Introducing FarsEval-PKBETS with 4000 diverse Persian questions

Current LLMs score below 50% on this challenging benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces FarsEval-PKBETS Persian benchmark

Includes 4000 diverse questions and answers

Evaluates models with cultural and linguistic relevance

🔎 Similar Papers

No similar papers found.