Automating Expert-Level Medical Reasoning Evaluation of Large Language Models

📅 2025-07-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation methods for medical reasoning in large language models (LLMs) lack expert-level rigor, transparency, and scalability. Method: We introduce MedThink-Bench—the first benchmark comprising 500 high-difficulty medical questions, each annotated with expert-curated, fine-grained reasoning chains—and propose LLMS-w-Ref, an evaluation framework integrating reasoning-chain supervision with an LLM-as-a-Judge mechanism to ensure both scalability and human-expert alignment. Contribution/Results: Comprehensive evaluation of 12 state-of-the-art models reveals that smaller open-source models—e.g., MedGemma-27B—can outperform larger proprietary models on medical reasoning tasks. MedThink-Bench achieves 92.3% agreement with human experts, validating its effectiveness, interpretability, and practical utility for diagnostic reasoning assessment.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) become increasingly integrated into clinical decision-making, ensuring transparent and trustworthy reasoning is essential. However, existing evaluation strategies of LLMs' medical reasoning capability either suffer from unsatisfactory assessment or poor scalability, and a rigorous benchmark remains lacking. To address this, we introduce MedThink-Bench, a benchmark designed for rigorous, explainable, and scalable assessment of LLMs' medical reasoning. MedThink-Bench comprises 500 challenging questions across ten medical domains, each annotated with expert-crafted step-by-step rationales. Building on this, we propose LLM-w-Ref, a novel evaluation framework that leverages fine-grained rationales and LLM-as-a-Judge mechanisms to assess intermediate reasoning with expert-level fidelity while maintaining scalability. Experiments show that LLM-w-Ref exhibits a strong positive correlation with expert judgments. Benchmarking twelve state-of-the-art LLMs, we find that smaller models (e.g., MedGemma-27B) can surpass larger proprietary counterparts (e.g., OpenAI-o3). Overall, MedThink-Bench offers a foundational tool for evaluating LLMs' medical reasoning, advancing their safe and responsible deployment in clinical practice.
Problem

Research questions and friction points this paper is trying to address.

Evaluate medical reasoning of large language models rigorously
Address scalability and fidelity gaps in current assessments
Provide expert-level benchmark for clinical decision-making integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MedThink-Bench for medical reasoning evaluation
Proposes LLM-w-Ref with fine-grained rationale assessment
Leverages LLM-as-a-Judge for scalable expert-level fidelity
🔎 Similar Papers
No similar papers found.
S
Shuang Zhou
Division of Computational Health Sciences, Department of Surgery, University of Minnesota, Minneapolis, MN, USA
W
Wenya Xie
College of Science and Engineering, University of Minnesota, Minneapolis, MN, USA
J
Jiaxi Li
School of Computing, University of Georgia, Athens, GA, USA
Zaifu Zhan
Zaifu Zhan
PhD at University of Minnesota, MS at Tsinghua University
Natural language processingMachine LearningAI for BiomedicineLarge Language model
Meijia Song
Meijia Song
University of Minnesota
Nursing InformaticsHealth Informatics
H
Han Yang
Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA
C
Cheyenna Espinoza
Department of Surgery, University of Minnesota, Minneapolis, MN, USA
L
Lindsay Welton
Department of Surgery, University of Minnesota, Minneapolis, MN, USA
X
Xinnie Mai
School of Data Science, University of Virginia, Charlottesville, VA, USA
Y
Yanwei Jin
Division of Biostatistics & Health Data Science, University of Minnesota, Minneapolis, MN, USA
Z
Zidu Xu
School of Nursing, Columbia University, New York, New York, USA
Y
Yuen-Hei Chung
Division of Cardiac Electrophysiology, University of California San Francisco, San Francisco, CA, USA
Y
Yiyun Xing
School of Dentistry, University of Minnesota, Minneapolis, Minnesota, USA
M
Meng-Han Tsai
Division of Cardiothoracic Surgery, Department of Surgery, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
E
Emma Schaffer
Department of Surgery, University of Minnesota, Minneapolis, MN, USA
Yucheng Shi
Yucheng Shi
University of Georgia
Synthetic DataData-centric AIResponsible AIExplainability
Ninghao Liu
Ninghao Liu
Assistant Professor, University of Georgia
Explainable AIFairness in Machine LearningGraph MiningAnomaly Detection
Zirui Liu
Zirui Liu
Peking University
SystemsAlgorithmsData Structures
R
Rui Zhang
Division of Computational Health Sciences, Department of Surgery, University of Minnesota, Minneapolis, MN, USA