Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the susceptibility of existing multiple-choice question answering (MCQA)–based evaluations of medical large language models to guessing and answer bias, which often inflate estimates of true clinical reasoning capabilities. To mitigate these limitations, the authors introduce a more rigorous benchmark based on Polish medical licensing examinations, incorporating over 15,000 new questions spanning two additional clinical domains and implementing four structural enhancements designed to reduce inherent MCQA biases. The benchmark facilitates cross-lingual evaluation and data contamination detection, and was used to systematically assess 21 prominent large language models. Results reveal that under this more challenging setting, the top-performing model, Qwen3.5-122B, exhibits performance drops of 28.4 and 31 percentage points on the English and Polish exams, respectively, underscoring the inadequacy of standard MCQA scores as reliable indicators of genuine medical competence.

📝 Abstract

Large language models (LLMs) in medicine are mainly evaluated using multiple-choice question answering (MCQA), which can overestimate real clinical ability due to guessing strategies and answer biases. To address these limitations, we introduce an expanded and more challenging benchmark based on Polish medical exams, adding over 15,000 questions, two new domains, and four structural modifications that reduce MCQA-specific artifacts and better test reasoning. We evaluate 21 LLMs and show that evaluation design strongly affects results. Under our harder setup, the best model (Qwen3.5-122B) drops by 28.4 and 31 pp on English and Polish exams, respectively. Despite low evidence of data contamination, standard MCQA scores do not reliably reflect true medical competence. To facilitate further research, we make our benchmark publicly available.

Problem

Research questions and friction points this paper is trying to address.

large language models

medical competence

multiple-choice question answering

evaluation bias

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

medical benchmark

multiple-choice question answering

language model evaluation