LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing medical large language model (LLM) evaluation benchmarks suffer from three key limitations: narrow question formats (predominantly multiple-choice), lack of real-world clinical relevance, and insufficient assessment of complex clinical reasoning. To address these, we introduce MedBench—the first medical LLM benchmark grounded in authentic electronic health records and expert-designed clinical scenarios—comprising 2,996 multi-format questions across five core clinical domains. We propose a physician-validated, LLM-as-Judge automated evaluation pipeline integrating dynamically optimized clinical checklists, human–AI consistency analysis, and multi-granularity reasoning assessment. We systematically evaluate 13 state-of-the-art medical and general-purpose LLMs, and publicly release all data and code. MedBench significantly enhances evaluation authenticity, comprehensiveness, and interpretability, establishing a new clinical-oriented standard for medical LLM assessment.

Technology Category

Application Category

📝 Abstract

Evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error. Current medical benchmarks have three main types: medical exam-based, comprehensive medical, and specialized assessments. However, these benchmarks have limitations in question design (mostly multiple-choice), data sources (often not derived from real clinical scenarios), and evaluation methods (poor assessment of complex reasoning). To address these issues, we present LLMEval-Med, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios. We also design an automated evaluation pipeline, incorporating expert-developed checklists into our LLM-as-Judge framework. Furthermore, our methodology validates machine scoring through human-machine agreement analysis, dynamically refining checklists and prompts based on expert feedback to ensure reliability. We evaluate 13 LLMs across three categories (specialized medical models, open-source models, and closed-source models) on LLMEval-Med, providing valuable insights for the safe and effective deployment of LLMs in medical domains. The dataset is released in https://github.com/llmeval/LLMEval-Med.

Problem

Research questions and friction points this paper is trying to address.

Current medical benchmarks lack real-world clinical scenario data

Existing evaluations poorly assess complex reasoning in medical LLMs

Need reliable automated scoring validated by human-machine agreement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-world EHR-based medical benchmark

Automated evaluation with expert checklists

Dynamic refinement via human-machine agreement

🔎 Similar Papers

No similar papers found.

Authors to Follow