Automating Expert-Level Medical Reasoning Evaluation of Large Language Models

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing evaluation methods for medical reasoning in large language models (LLMs) lack expert-level rigor, transparency, and scalability. Method: We introduce MedThink-Bench—the first benchmark comprising 500 high-difficulty medical questions, each annotated with expert-curated, fine-grained reasoning chains—and propose LLMS-w-Ref, an evaluation framework integrating reasoning-chain supervision with an LLM-as-a-Judge mechanism to ensure both scalability and human-expert alignment. Contribution/Results: Comprehensive evaluation of 12 state-of-the-art models reveals that smaller open-source models—e.g., MedGemma-27B—can outperform larger proprietary models on medical reasoning tasks. MedThink-Bench achieves 92.3% agreement with human experts, validating its effectiveness, interpretability, and practical utility for diagnostic reasoning assessment.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) become increasingly integrated into clinical decision-making, ensuring transparent and trustworthy reasoning is essential. However, existing evaluation strategies of LLMs' medical reasoning capability either suffer from unsatisfactory assessment or poor scalability, and a rigorous benchmark remains lacking. To address this, we introduce MedThink-Bench, a benchmark designed for rigorous, explainable, and scalable assessment of LLMs' medical reasoning. MedThink-Bench comprises 500 challenging questions across ten medical domains, each annotated with expert-crafted step-by-step rationales. Building on this, we propose LLM-w-Ref, a novel evaluation framework that leverages fine-grained rationales and LLM-as-a-Judge mechanisms to assess intermediate reasoning with expert-level fidelity while maintaining scalability. Experiments show that LLM-w-Ref exhibits a strong positive correlation with expert judgments. Benchmarking twelve state-of-the-art LLMs, we find that smaller models (e.g., MedGemma-27B) can surpass larger proprietary counterparts (e.g., OpenAI-o3). Overall, MedThink-Bench offers a foundational tool for evaluating LLMs' medical reasoning, advancing their safe and responsible deployment in clinical practice.

Problem

Research questions and friction points this paper is trying to address.

Evaluate medical reasoning of large language models rigorously

Address scalability and fidelity gaps in current assessments

Provide expert-level benchmark for clinical decision-making integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MedThink-Bench for medical reasoning evaluation

Proposes LLM-w-Ref with fine-grained rationale assessment

Leverages LLM-as-a-Judge for scalable expert-level fidelity

🔎 Similar Papers

No similar papers found.

Authors to Follow