It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work challenges the validity of multiple-choice question (MCQ) benchmarks in assessing large language models’ (LLMs) clinical reasoning capabilities, arguing that the MCQ format itself may inflate performance artificially. To address this, we introduce FreeMedQA—the first paired free-response–MCQ medical benchmark—and conduct systematic evaluations on GPT-4o, GPT-3.5, and Llama-3-70B-instruct. Employing a novel format-masking ablation study, we find that LLMs achieve, on average, 39.43% higher accuracy on MCQs than on corresponding free-response tasks (p = 1.3 × 10⁻⁵). Notably, under full format masking, GPT-4o still attains 37.34% accuracy—6.70 percentage points above chance—demonstrating substantial reliance on syntactic and structural cues rather than semantic comprehension. These results indicate that current MCQ-based evaluations significantly overestimate LLMs’ true clinical reasoning proficiency; free-response assessment provides a more rigorous and reliable alternative.

Technology Category

Application Category

📝 Abstract
The performance of Large Language Models (LLMs) on multiple-choice question (MCQ) benchmarks is frequently cited as proof of their medical capabilities. We hypothesized that LLM performance on medical MCQs may in part be illusory and driven by factors beyond medical content knowledge and reasoning capabilities. To assess this, we created a novel benchmark of free-response questions with paired MCQs (FreeMedQA). Using this benchmark, we evaluated three state-of-the-art LLMs (GPT-4o, GPT-3.5, and LLama-3-70B-instruct) and found an average absolute deterioration of 39.43% in performance on free-response questions relative to multiple-choice (p = 1.3 * 10-5) which was greater than the human performance decline of 22.29%. To isolate the role of the MCQ format on performance, we performed a masking study, iteratively masking out parts of the question stem. At 100% masking, the average LLM multiple-choice performance was 6.70% greater than random chance (p = 0.002) with one LLM (GPT-4o) obtaining an accuracy of 37.34%. Notably, for all LLMs the free-response performance was near zero. Our results highlight the shortcomings in medical MCQ benchmarks for overestimating the capabilities of LLMs in medicine, and, broadly, the potential for improving both human and machine assessments using LLM-evaluated free-response questions.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLM performance on medical MCQs versus free-response questions.
Identifying overestimation of LLM capabilities in medical benchmarks.
Exploring improvements in human and machine assessments using free-response formats.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed FreeMedQA benchmark for free-response questions.
Evaluated LLMs using paired MCQs and free-response formats.
Conducted masking study to isolate MCQ format impact.
🔎 Similar Papers
No similar papers found.
S
Shrutika Singh
Department of Neurosurgery, NYU Langone Health, New York, NY, USA
Anton Alyakin
Anton Alyakin
medical student at washington univesity
llmsneurosurgerynetworkscausality
D
D. Alber
Department of Neurosurgery, NYU Langone Health, New York, NY, USA
J
Jaden Stryker
Department of Neurosurgery, NYU Langone Health, New York, NY, USA
A
Ai Phuong S Tong
University of Washington School of Medicine, Seattle, Washington, USA
Karl L. Sangwon
Karl L. Sangwon
Medical Student at NYU Grossman School of Medicine
NeurosurgeryApplied Math
N
Nicolas Goff
Department of Neurosurgery, NYU Langone Health, New York, NY, USA
M
Mathew de la Paz
Washington University in Saint Louis School of Medicine, Saint Louis, Missouri, USA
M
Miguel Hernandez-Rovira
Department of Neurosurgery, Washington University in Saint Louis, Saint Louis, Missouri, USA
K
Ki Yun Park
Department of Neurosurgery, Washington University in Saint Louis, Saint Louis, Missouri, USA
E
Eric C. Leuthardt
Department of Neurosurgery, Washington University in Saint Louis, Saint Louis, Missouri, USA
E
E. Oermann
Department of Neurosurgery, NYU Langone Health, New York, NY, USA