Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the susceptibility of existing multiple-choice question answering (MCQA)–based evaluations of medical large language models to guessing and answer bias, which often inflate estimates of true clinical reasoning capabilities. To mitigate these limitations, the authors introduce a more rigorous benchmark based on Polish medical licensing examinations, incorporating over 15,000 new questions spanning two additional clinical domains and implementing four structural enhancements designed to reduce inherent MCQA biases. The benchmark facilitates cross-lingual evaluation and data contamination detection, and was used to systematically assess 21 prominent large language models. Results reveal that under this more challenging setting, the top-performing model, Qwen3.5-122B, exhibits performance drops of 28.4 and 31 percentage points on the English and Polish exams, respectively, underscoring the inadequacy of standard MCQA scores as reliable indicators of genuine medical competence.
📝 Abstract
Large language models (LLMs) in medicine are mainly evaluated using multiple-choice question answering (MCQA), which can overestimate real clinical ability due to guessing strategies and answer biases. To address these limitations, we introduce an expanded and more challenging benchmark based on Polish medical exams, adding over 15,000 questions, two new domains, and four structural modifications that reduce MCQA-specific artifacts and better test reasoning. We evaluate 21 LLMs and show that evaluation design strongly affects results. Under our harder setup, the best model (Qwen3.5-122B) drops by 28.4 and 31 pp on English and Polish exams, respectively. Despite low evidence of data contamination, standard MCQA scores do not reliably reflect true medical competence. To facilitate further research, we make our benchmark publicly available.
Problem

Research questions and friction points this paper is trying to address.

large language models
medical competence
multiple-choice question answering
evaluation bias
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

medical benchmark
multiple-choice question answering
language model evaluation
reasoning assessment
bias mitigation
🔎 Similar Papers
No similar papers found.
A
Antoni Lasik
NASK National Research Institute
Jakub Pokrywka
Jakub Pokrywka
Adam Mickiewicz University
Machine Learning
Łukasz Grzybowski
Łukasz Grzybowski
Association for Research and Applications of Artificial Intelligence
machine learningartificial intelligencedatadata engineering
J
Jeremi Ignacy Kaczmarek
Adam Mickiewicz University, Poznań University of Medical Sciences, T. Marciniak Lower Silesian Specialist Hospital
G
Gabriela Korzańska
Poznań University of Medical Sciences
J
Janusz Świeczkowski-Feiz
Centre of Postgraduate Medical Education, Poland, Medical University of Warsaw
O
Oskar Pastuszek
T. Marciniak Lower Silesian Specialist Hospital
P
Paulina Hoffman
Medical University of Warsaw
J
Jakub Tomasz Dąbrowski
Centre of Postgraduate Medical Education, Poland
Wojciech Kusa
Wojciech Kusa
NASK National Research Institute
Natural Language ProcessingInformation RetrievalMachine LearningLLMs