Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning

📅 2025-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit severe limitations in open-ended clinical reasoning due to training data biases, leading to cognitive rigidity (the Einstellung effect)—manifested as overreliance on pattern matching, deficits in medical commonsense and flexible inference, and high hallucination rates coupled with unwarranted confidence. Method: We introduce M-ARC, the first benchmark explicitly designed to assess clinical abstraction and reasoning, grounded in cognitive psychology principles to systematically elicit and quantify the Einstellung effect. M-ARC features expert physician annotations, uncertainty calibration, and comparative evaluation across state-of-the-art models (e.g., o1, Gemini). Contribution/Results: Empirical analysis reveals that LLMs underperform physicians significantly on M-ARC. This work pioneers the integration of the Einstellung effect into medical AI evaluation, proposing an interpretable, reproducible, cognition-driven assessment framework—establishing a novel benchmark and theoretical foundation for risk assessment in clinical LLM deployment.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have attained human-level accuracy on medical question-answer (QA) benchmarks. However, their limitations in navigating open-ended clinical scenarios have recently been shown, raising concerns about the robustness and generalizability of LLM reasoning across diverse, real-world medical tasks. To probe potential LLM failure modes in clinical problem-solving, we present the medical abstraction and reasoning corpus (M-ARC). M-ARC assesses clinical reasoning through scenarios designed to exploit the Einstellung effect -- the fixation of thought arising from prior experience, targeting LLM inductive biases toward inflexible pattern matching from their training data rather than engaging in flexible reasoning. We find that LLMs, including current state-of-the-art o1 and Gemini models, perform poorly compared to physicians on M-ARC, often demonstrating lack of commonsense medical reasoning and a propensity to hallucinate. In addition, uncertainty estimation analyses indicate that LLMs exhibit overconfidence in their answers, despite their limited accuracy. The failure modes revealed by M-ARC in LLM medical reasoning underscore the need to exercise caution when deploying these models in clinical settings.
Problem

Research questions and friction points this paper is trying to address.

LLMs struggle with open-ended clinical scenarios.
M-ARC exposes LLM inflexibility in medical reasoning.
LLMs show overconfidence despite poor clinical accuracy.
Innovation

Methods, ideas, or system contributions that make the work stand out.

M-ARC assesses clinical reasoning
Targets LLM inductive biases
Reveals LLM overconfidence inaccuracy
🔎 Similar Papers
No similar papers found.
J
Jonathan Kim
Department of Neurology and Neurologic Sciences, Stanford University, Palo Alto, CA
A
Anna Podlasek
Image Guided Therapy and Research Facility, University of Dundee, Dundee, UK
K
Kie Shidara
Weill Institute of Neurology and Neurosciences, University of California, San Francisco, San Francisco, CA
F
Feng Liu
Department of Systems and Enterprises, Stevens Institute of Technology, Hoboken, NJ
A
Ahmed Alaa
Department of EECS, University of California Berkeley, Berkeley, CA
Danilo Bernardo
Danilo Bernardo
University of California, San Francisco
EpilepsyPediatric Epilepsy