It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work challenges the validity of multiple-choice question (MCQ) benchmarks in assessing large language models’ (LLMs) clinical reasoning capabilities, arguing that the MCQ format itself may inflate performance artificially. To address this, we introduce FreeMedQA—the first paired free-response–MCQ medical benchmark—and conduct systematic evaluations on GPT-4o, GPT-3.5, and Llama-3-70B-instruct. Employing a novel format-masking ablation study, we find that LLMs achieve, on average, 39.43% higher accuracy on MCQs than on corresponding free-response tasks (p = 1.3 × 10⁻⁵). Notably, under full format masking, GPT-4o still attains 37.34% accuracy—6.70 percentage points above chance—demonstrating substantial reliance on syntactic and structural cues rather than semantic comprehension. These results indicate that current MCQ-based evaluations significantly overestimate LLMs’ true clinical reasoning proficiency; free-response assessment provides a more rigorous and reliable alternative.

Technology Category

Application Category

📝 Abstract

The performance of Large Language Models (LLMs) on multiple-choice question (MCQ) benchmarks is frequently cited as proof of their medical capabilities. We hypothesized that LLM performance on medical MCQs may in part be illusory and driven by factors beyond medical content knowledge and reasoning capabilities. To assess this, we created a novel benchmark of free-response questions with paired MCQs (FreeMedQA). Using this benchmark, we evaluated three state-of-the-art LLMs (GPT-4o, GPT-3.5, and LLama-3-70B-instruct) and found an average absolute deterioration of 39.43% in performance on free-response questions relative to multiple-choice (p = 1.3 * 10-5) which was greater than the human performance decline of 22.29%. To isolate the role of the MCQ format on performance, we performed a masking study, iteratively masking out parts of the question stem. At 100% masking, the average LLM multiple-choice performance was 6.70% greater than random chance (p = 0.002) with one LLM (GPT-4o) obtaining an accuracy of 37.34%. Notably, for all LLMs the free-response performance was near zero. Our results highlight the shortcomings in medical MCQ benchmarks for overestimating the capabilities of LLMs in medicine, and, broadly, the potential for improving both human and machine assessments using LLM-evaluated free-response questions.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLM performance on medical MCQs versus free-response questions.

Identifying overestimation of LLM capabilities in medical benchmarks.

Exploring improvements in human and machine assessments using free-response formats.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed FreeMedQA benchmark for free-response questions.

Evaluated LLMs using paired MCQs and free-response formats.

Conducted masking study to isolate MCQ format impact.

🔎 Similar Papers

No similar papers found.

Authors to Follow