FarsEval-PKBETS: A new diverse benchmark for evaluating Persian large language models

📅 2025-04-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Persian large language models (LLMs) lack systematic, culturally grounded evaluation benchmarks. Method: We introduce FarsEval-PKBETS—the first comprehensive, culture-adapted benchmark for Persian LLMs—covering ten high-relevance domains including medicine, law, religion, and ethics, with 4,000 multi-format items (multiple-choice, short-answer, descriptive). It integrates Persian linguistic features, Iranian sociocultural context, and domain-specific local knowledge. We propose a novel evaluation framework balancing linguistic rigor, cultural fidelity, and real-world task difficulty. Contribution/Results: Calibrated across Llama3-70B, PersianMind, and Dorna, item-level accuracy averages below 50%, confirming high discriminative power. Experiments reveal critical weaknesses in Persian LLMs across complex reasoning, culturally sensitive tasks, and specialized domain knowledge—effectively addressing the scarcity of rigorous evaluation resources for low-resource languages.

Technology Category

Application Category

📝 Abstract
Research on evaluating and analyzing large language models (LLMs) has been extensive for resource-rich languages such as English, yet their performance in languages such as Persian has received considerably less attention. This paper introduces FarsEval-PKBETS benchmark, a subset of FarsEval project for evaluating large language models in Persian. This benchmark consists of 4000 questions and answers in various formats, including multiple choice, short answer and descriptive responses. It covers a wide range of domains and tasks,including medicine, law, religion, Persian language, encyclopedic knowledge, human preferences, social knowledge, ethics and bias, text generation, and respecting others' rights. This bechmark incorporates linguistics, cultural, and local considerations relevant to the Persian language and Iran. To ensure the questions are challenging for current LLMs, three models -- Llama3-70B, PersianMind, and Dorna -- were evaluated using this benchmark. Their average accuracy was below 50%, meaning they provided fully correct answers to fewer than half of the questions. These results indicate that current language models are still far from being able to solve this benchmark
Problem

Research questions and friction points this paper is trying to address.

Evaluating Persian LLMs lacks benchmarks compared to English
Introducing FarsEval-PKBETS with 4000 diverse Persian questions
Current LLMs score below 50% on this challenging benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces FarsEval-PKBETS Persian benchmark
Includes 4000 diverse questions and answers
Evaluates models with cultural and linguistic relevance
🔎 Similar Papers
No similar papers found.
Mehrnoush Shamsfard
Mehrnoush Shamsfard
Associate Professor of Computer Engineering, Shahid Beheshti University
Artificial IntelligenceNatural Language ProcessingPersian NLPOntologySemantic Web
Zahra Saaberi
Zahra Saaberi
SBU NLP Lab, Shahid Beheshti University
Artificial IntelligenceNatural Language Processing
M
Mostafa Karimi manesh
NLP Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran
S
Seyed Mohammad Hossein Hashemi
NLP Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran
Z
Zahra Vatankhah
NLP Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran
M
Motahareh Ramezani
NLP Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran
N
Niki Pourazin
Faculty of Computer Engineering , Amirkabir University of Technology , Tehran , Iran
Tara Zare
Tara Zare
Shahid Beheshti University
Artificial IntelligenceNatural Language ProcessingDeep LearningMachine LearningBig Data
M
Maryam Azimi
Faculty of Law, Qom University, Qom, Iran
S
Sarina Chitsaz
NLP Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran
S
Sama Khoraminejad
Faculty of Medicine , Tehr an Medical Sciences, Islamic Azad University , Tehran , Iran
M
Morteza Mahdavi Mortazavi
NLP Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran
M
Mohammad Mahdi Chizari
NLP Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran
S
Sahar Maleki
NLP Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran
S
Seyed Soroush Majd
NLP Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran
M
Mostafa Masumi
Computer Engineering Department, Sharif University of Technology, Tehran, Iran
S
Sayed Ali Musavi Khoeini
NLP Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran
A
Amir Mohseni
NLP Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran
S
Sogol Alipour
NLP Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran