LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing medical large language model (LLM) evaluation benchmarks suffer from three key limitations: narrow question formats (predominantly multiple-choice), lack of real-world clinical relevance, and insufficient assessment of complex clinical reasoning. To address these, we introduce MedBench—the first medical LLM benchmark grounded in authentic electronic health records and expert-designed clinical scenarios—comprising 2,996 multi-format questions across five core clinical domains. We propose a physician-validated, LLM-as-Judge automated evaluation pipeline integrating dynamically optimized clinical checklists, human–AI consistency analysis, and multi-granularity reasoning assessment. We systematically evaluate 13 state-of-the-art medical and general-purpose LLMs, and publicly release all data and code. MedBench significantly enhances evaluation authenticity, comprehensiveness, and interpretability, establishing a new clinical-oriented standard for medical LLM assessment.

Technology Category

Application Category

📝 Abstract
Evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error. Current medical benchmarks have three main types: medical exam-based, comprehensive medical, and specialized assessments. However, these benchmarks have limitations in question design (mostly multiple-choice), data sources (often not derived from real clinical scenarios), and evaluation methods (poor assessment of complex reasoning). To address these issues, we present LLMEval-Med, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios. We also design an automated evaluation pipeline, incorporating expert-developed checklists into our LLM-as-Judge framework. Furthermore, our methodology validates machine scoring through human-machine agreement analysis, dynamically refining checklists and prompts based on expert feedback to ensure reliability. We evaluate 13 LLMs across three categories (specialized medical models, open-source models, and closed-source models) on LLMEval-Med, providing valuable insights for the safe and effective deployment of LLMs in medical domains. The dataset is released in https://github.com/llmeval/LLMEval-Med.
Problem

Research questions and friction points this paper is trying to address.

Current medical benchmarks lack real-world clinical scenario data
Existing evaluations poorly assess complex reasoning in medical LLMs
Need reliable automated scoring validated by human-machine agreement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-world EHR-based medical benchmark
Automated evaluation with expert checklists
Dynamic refinement via human-machine agreement
🔎 Similar Papers
No similar papers found.
M
Ming Zhang
Computation and Arificial intelligence innovative College, Fudan University
Y
Yujiong Shen
Computation and Arificial intelligence innovative College, Fudan University
Z
Zelin Li
Northwestern University
H
Huayu Sha
Computation and Arificial intelligence innovative College, Fudan University
B
Binze Hu
Computation and Arificial intelligence innovative College, Fudan University
Y
Yuhui Wang
Computation and Arificial intelligence innovative College, Fudan University
Chenhao Huang
Chenhao Huang
School of Computer Science, University of Sydney
Distributed data managementDistributed systems
Shichun Liu
Shichun Liu
Fudan University
NLP
J
Jingqi Tong
Computation and Arificial intelligence innovative College, Fudan University
C
Changhao Jiang
Computation and Arificial intelligence innovative College, Fudan University
Mingxu Chai
Mingxu Chai
Fudan University
Zhiheng Xi
Zhiheng Xi
Fudan University
LLM ReasoningLLM-based Agents
Shihan Dou
Shihan Dou
Fudan University
LLMsCode LMsRLAlignment
T
Tao Gui
Institute of Modern Languages and Linguistics, Fudan University
Q
Qi Zhang
Computation and Arificial intelligence innovative College, Fudan University
X
Xuanjing Huang
Computation and Arificial intelligence innovative College, Fudan University