Does Biomedical Training Lead to Better Medical Performance?

📅 2024-04-05
📈 Citations: 7
Influential: 0
📄 PDF
🤖 AI Summary
Despite widespread adoption of domain-specific fine-tuning for biomedical large language models (LLMs), there remains a lack of systematic empirical evidence on its actual impact on clinical capabilities. Method: We conduct a comprehensive evaluation of 25 state-of-the-art LLMs—including both general-purpose and biomedical-fine-tuned variants—across six standardized clinical tasks, using CLUE, a reproducible, open-source medical evaluation framework (with all code and data publicly released). Contribution/Results: Our study is the first to empirically demonstrate that most biomedical-fine-tuned models underperform general-purpose models in critical clinical competencies—including hallucination suppression, ICD-10 coding accuracy, and instruction following. Notably, Llama-3.1-70B-Instruct surpasses specialized biomedical models across multiple tasks, revealing inherent trade-offs in domain adaptation. These findings challenge the prevailing assumption that biomedical fine-tuning inherently enhances clinical performance, establishing a rigorous empirical benchmark and offering methodological insights for LLM deployment in healthcare.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are expected to significantly contribute to patient care, diagnostics, and administrative processes. Emerging biomedical LLMs aim to address healthcare-specific challenges, including privacy demands and computational constraints. Assessing the models' suitability for this sensitive application area is of the utmost importance. However, biomedical training has not been systematically evaluated on medical tasks. This study investigates the effect of biomedical training in the context of six practical medical tasks evaluating $25$ models. In contrast to previous evaluations, our results reveal a performance decline in nine out of twelve biomedical models after fine-tuning, particularly on tasks involving hallucinations, ICD10 coding, and instruction adherence. General-domain models like Meta-Llama-3.1-70B-Instruct outperformed their biomedical counterparts, indicating a trade-off between domain-specific fine-tuning and general medical task performance. We open-source all evaluation scripts and datasets at https://github.com/TIO-IKIM/CLUE to support further research in this critical area.
Problem

Research questions and friction points this paper is trying to address.

Evaluating biomedical LLM training impact on medical task performance
Assessing performance decline in biomedical models after fine-tuning
Comparing general-domain versus biomedical models on medical applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated biomedical LLMs on six medical tasks
Found performance decline after domain-specific fine-tuning
General models outperformed biomedical counterparts on tasks
🔎 Similar Papers
No similar papers found.
Amin Dada
Amin Dada
Institute for AI in Medicine (IKIM), University Hospital Essen
Marie Bauer
Marie Bauer
Software Developer, Insitute for AI in Medicine, Essen
NLPMLComputational LinguisticsLinguistics
A
Amanda Butler Contreras
NVIDIA, Santa Clara, CA, USA
O
Osman Alperen Koracs
Institute for AI in Medicine (IKIM), University Hospital Essen (AöR), Essen, Germany
Constantin Seibold
Constantin Seibold
University Heidelberg
Computer VisionMachine LearningMedical Image Analysis
Kaleb E. Smith
Kaleb E. Smith
Nvidia
Machine learningGenerative ModelsDeep LearningComputer VisionTime Series Analysis
J
J. Kleesiek
Institute for AI in Medicine (IKIM), University Hospital Essen (AöR), Essen, Germany; Cancer Research Center Cologne Essen (CCCE), West German Cancer Center Essen University Hospital Essen (AöR), Essen, Germany; German Cancer Consortium (DKTK, Partner site Essen), Heidelberg, Germany; Department of Physics, TU Dortmund, Dortmund, Germany