PerHalluEval: Persian Hallucination Evaluation Benchmark for Large Language Models

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Hallucination evaluation for low-resource languages like Persian remains underexplored, hindering reliable assessment of large language models (LLMs) in such settings. Method: We introduce PerHalluEval—the first dynamic hallucination benchmark for Persian—built via an LLM-driven three-stage pipeline: question generation, answer/summary generation, and hallucination annotation. This pipeline integrates human verification, culturally sensitive labeling, and token log-probability filtering to select highly deceptive hallucinated samples. Additionally, we incorporate the original document as external knowledge in both question-answering and summarization tasks to mitigate hallucination. Contribution/Results: PerHalluEval establishes the first Persian-specific hallucination evaluation framework. Experimental results across 12 mainstream LLMs reveal consistently poor hallucination robustness, with Persian-specialized models showing no significant advantage over multilingual ones. Empirical analysis confirms that integrating external document knowledge partially suppresses hallucination, offering a practical mitigation strategy for low-resource language applications.

Technology Category

Application Category

📝 Abstract

Hallucination is a persistent issue affecting all large language Models (LLMs), particularly within low-resource languages such as Persian. PerHalluEval (Persian Hallucination Evaluation) is the first dynamic hallucination evaluation benchmark tailored for the Persian language. Our benchmark leverages a three-stage LLM-driven pipeline, augmented with human validation, to generate plausible answers and summaries regarding QA and summarization tasks, focusing on detecting extrinsic and intrinsic hallucinations. Moreover, we used the log probabilities of generated tokens to select the most believable hallucinated instances. In addition, we engaged human annotators to highlight Persian-specific contexts in the QA dataset in order to evaluate LLMs' performance on content specifically related to Persian culture. Our evaluation of 12 LLMs, including open- and closed-source models using PerHalluEval, revealed that the models generally struggle in detecting hallucinated Persian text. We showed that providing external knowledge, i.e., the original document for the summarization task, could mitigate hallucination partially. Furthermore, there was no significant difference in terms of hallucination when comparing LLMs specifically trained for Persian with others.

Problem

Research questions and friction points this paper is trying to address.

Evaluating hallucination in Persian language for large language models

Detecting extrinsic and intrinsic hallucinations in QA and summarization tasks

Assessing LLM performance on Persian-specific cultural content

Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-stage LLM-driven pipeline with human validation

Log probability selection of believable hallucinated instances

Human annotation for Persian-specific cultural contexts

🔎 Similar Papers

No similar papers found.