Large Language Model Benchmarks in Medical Tasks

📅 2024-10-28
🏛️ arXiv.org
📈 Citations: 15
Influential: 0
📄 PDF
🤖 AI Summary
Medical large language models (LLMs) lack systematic, cross-modal benchmarking frameworks tailored to clinical tasks. Method: We conduct the first comprehensive survey of clinical-task-oriented medical LLM benchmark datasets, integrating major text-, image-, and multimodal resources—including MIMIC-III/IV, BioASQ, PubMedQA, and CheXpert—via literature review, meta-analysis, and task mapping. These cover core clinical scenarios: electronic health records, doctor–patient dialogues, medical question answering, and radiology report generation. We propose a task-driven unified taxonomy, identifying critical gaps such as limited linguistic diversity and absence of structured omics data, and advocate synthetic data augmentation and multi-source fusion as novel evaluation paradigms. Contribution: Our work establishes a reusable, clinically grounded LLM evaluation framework that supports optimization of report generation, clinical summarization, and predictive decision-making, thereby advancing robust, multimodal medical AI.

Technology Category

Application Category

📝 Abstract
With the increasing application of large language models (LLMs) in the medical domain, evaluating these models'performance using benchmark datasets has become crucial. This paper presents a comprehensive survey of various benchmark datasets employed in medical LLM tasks. These datasets span multiple modalities including text, image, and multimodal benchmarks, focusing on different aspects of medical knowledge such as electronic health records (EHRs), doctor-patient dialogues, medical question-answering, and medical image captioning. The survey categorizes the datasets by modality, discussing their significance, data structure, and impact on the development of LLMs for clinical tasks such as diagnosis, report generation, and predictive decision support. Key benchmarks include MIMIC-III, MIMIC-IV, BioASQ, PubMedQA, and CheXpert, which have facilitated advancements in tasks like medical report generation, clinical summarization, and synthetic data generation. The paper summarizes the challenges and opportunities in leveraging these benchmarks for advancing multimodal medical intelligence, emphasizing the need for datasets with a greater degree of language diversity, structured omics data, and innovative approaches to synthesis. This work also provides a foundation for future research in the application of LLMs in medicine, contributing to the evolving field of medical artificial intelligence.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM performance using medical benchmark datasets
Surveying multimodal benchmarks for clinical diagnosis and reporting
Addressing challenges in medical AI through diverse dataset integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Surveying benchmark datasets for medical LLM evaluation
Categorizing multimodal medical data across clinical tasks
Identifying needs for diverse language and omics data
🔎 Similar Papers
No similar papers found.
L
Lawrence K.Q. Yan
Hong Kong University of Science and Technology
Qian Niu
Qian Niu
UT Austin
Condensed matter physics
M
Ming Li
Georgia Institute of Technology
Y
Yichao Zhang
The University of Texas at Dallas
C
Caitlyn Heqi Yin
University of Wisconsin-Madison
C
Cheng Fei
Cornell University
Benji Peng
Benji Peng
Principle Investigator at AppCubic
Machine LearningBiophysics
Z
Ziqian Bi
Indiana University
P
Pohsun Feng
National Taiwan Normal University
K
Keyu Chen
Georgia Institute of Technology
J
Junyu Liu
Kyoto University