PersianMedQA: Language-Centric Evaluation of LLMs in the Persian Medical Domain

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the reliability of large language models (LLMs) in high-stakes medical applications for low-resource languages—specifically Persian. To address the scarcity of validated evaluation resources, we introduce the first large-scale, expert-validated Persian medical multiple-choice question dataset. We propose a cross-lingual (Persian/English) zero-shot and chain-of-thought evaluation framework, augmented by bilingual consistency verification and translation impact attribution analysis. Our work is the first to systematically expose performance gaps in Persian medical LLMs, revealing that culturally grounded clinical context substantially improves accuracy, while scaling model parameters alone cannot compensate for the lack of joint language-domain fine-tuning. Experiments show GPT-4.1 achieves 83.3% and 80.7% accuracy on Persian and English test sets, respectively, whereas the Persian-specific model Dorna attains only 35.9%, underscoring the critical roles of translation non-equivalence and contextual sensitivity in medical LLM reliability.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have achieved remarkable performance on a wide range of NLP benchmarks, often surpassing human-level accuracy. However, their reliability in high-stakes domains such as medicine, particularly in low-resource languages, remains underexplored. In this work, we introduce PersianMedQA, a large-scale, expert-validated dataset of multiple-choice Persian medical questions, designed to evaluate LLMs across both Persian and English. We benchmark over 40 state-of-the-art models, including general-purpose, Persian fine-tuned, and medical LLMs, in zero-shot and chain-of-thought (CoT) settings. Our results show that closed-source general models (e.g., GPT-4.1) consistently outperform all other categories, achieving 83.3% accuracy in Persian and 80.7% in English, while Persian fine-tuned models such as Dorna underperform significantly (e.g., 35.9% in Persian), often struggling with both instruction-following and domain reasoning. We also analyze the impact of translation, showing that while English performance is generally higher, Persian responses are sometimes more accurate due to cultural and clinical contextual cues. Finally, we demonstrate that model size alone is insufficient for robust performance without strong domain or language adaptation. PersianMedQA provides a foundation for evaluating multilingual and culturally grounded medical reasoning in LLMs. The PersianMedQA dataset can be accessed at: https://huggingface.co/datasets/MohammadJRanjbar/PersianMedQA](https://huggingface.co/datasets/MohammadJRanjbar/PersianMedQA
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs in Persian medical domain
Assessing reliability in low-resource medical contexts
Analyzing multilingual medical reasoning performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Expert-validated Persian medical QA dataset
Benchmarked 40+ models in zero-shot and CoT
Analyzed translation impact on model accuracy
🔎 Similar Papers
No similar papers found.
M
Mohammad Javad Ranjbar Kalahroodi
School of Electrical and Computer Engineering, University of Tehran, Iran
A
Amirhossein Sheikholselami
School of Electrical and Computer Engineering, University of Tehran, Iran
S
Sepehr Karimi
School of Electrical and Computer Engineering, University of Tehran, Iran
S
Sepideh Ranjbar Kalahroodi
Shahid Beheshti University of Medical Sciences, Iran
Heshaam Faili
Heshaam Faili
Full Professor, University of Tehran
Natural Language ProcessingSocial Network
A
A. Shakery
School of Electrical and Computer Engineering, University of Tehran, Iran; Institute for Research in Fundamental Sciences (IPM), Tehran, Iran