PersianMedQA: Language-Centric Evaluation of LLMs in the Persian Medical Domain

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study investigates the reliability of large language models (LLMs) in high-stakes medical applications for low-resource languages—specifically Persian. To address the scarcity of validated evaluation resources, we introduce the first large-scale, expert-validated Persian medical multiple-choice question dataset. We propose a cross-lingual (Persian/English) zero-shot and chain-of-thought evaluation framework, augmented by bilingual consistency verification and translation impact attribution analysis. Our work is the first to systematically expose performance gaps in Persian medical LLMs, revealing that culturally grounded clinical context substantially improves accuracy, while scaling model parameters alone cannot compensate for the lack of joint language-domain fine-tuning. Experiments show GPT-4.1 achieves 83.3% and 80.7% accuracy on Persian and English test sets, respectively, whereas the Persian-specific model Dorna attains only 35.9%, underscoring the critical roles of translation non-equivalence and contextual sensitivity in medical LLM reliability.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have achieved remarkable performance on a wide range of NLP benchmarks, often surpassing human-level accuracy. However, their reliability in high-stakes domains such as medicine, particularly in low-resource languages, remains underexplored. In this work, we introduce PersianMedQA, a large-scale, expert-validated dataset of multiple-choice Persian medical questions, designed to evaluate LLMs across both Persian and English. We benchmark over 40 state-of-the-art models, including general-purpose, Persian fine-tuned, and medical LLMs, in zero-shot and chain-of-thought (CoT) settings. Our results show that closed-source general models (e.g., GPT-4.1) consistently outperform all other categories, achieving 83.3% accuracy in Persian and 80.7% in English, while Persian fine-tuned models such as Dorna underperform significantly (e.g., 35.9% in Persian), often struggling with both instruction-following and domain reasoning. We also analyze the impact of translation, showing that while English performance is generally higher, Persian responses are sometimes more accurate due to cultural and clinical contextual cues. Finally, we demonstrate that model size alone is insufficient for robust performance without strong domain or language adaptation. PersianMedQA provides a foundation for evaluating multilingual and culturally grounded medical reasoning in LLMs. The PersianMedQA dataset can be accessed at: https://huggingface.co/datasets/MohammadJRanjbar/PersianMedQA](https://huggingface.co/datasets/MohammadJRanjbar/PersianMedQA

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs in Persian medical domain

Assessing reliability in low-resource medical contexts

Analyzing multilingual medical reasoning performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Expert-validated Persian medical QA dataset

Benchmarked 40+ models in zero-shot and CoT

Analyzed translation impact on model accuracy

🔎 Similar Papers

No similar papers found.

Authors to Follow