Assessing and Mitigating Medical Knowledge Drift and Conflicts in Large Language Models

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses knowledge drift and internal inconsistencies in large language models (LLMs) arising from dynamically evolving clinical guidelines. We introduce DriftMedQA—the first medical temporal reliability evaluation framework—comprising 4,290 clinically grounded scenarios, which systematically exposes critical deficiencies in mainstream LLMs, including failure to reject outdated recommendations and susceptibility to endorsing contradictory guidelines. To mitigate these issues, we propose a novel synergistic strategy integrating retrieval-augmented generation (RAG) with direct preference optimization (DPO): RAG injects temporally accurate clinical knowledge, while DPO fine-tunes model preferences to enhance recognition and rejection of outdated or conflicting content. Experimental results demonstrate that our approach improves outdated recommendation rejection rate by 37% and achieves 89.2% consistency in resolving guideline conflicts—significantly outperforming standalone methods—and substantially enhances temporal coherence and clinical trustworthiness of LLMs in rapidly updating medical domains.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have great potential in the field of health care, yet they face great challenges in adapting to rapidly evolving medical knowledge. This can lead to outdated or contradictory treatment suggestions. This study investigated how LLMs respond to evolving clinical guidelines, focusing on concept drift and internal inconsistencies. We developed the DriftMedQA benchmark to simulate guideline evolution and assessed the temporal reliability of various LLMs. Our evaluation of seven state-of-the-art models across 4,290 scenarios demonstrated difficulties in rejecting outdated recommendations and frequently endorsing conflicting guidance. Additionally, we explored two mitigation strategies: Retrieval-Augmented Generation and preference fine-tuning via Direct Preference Optimization. While each method improved model performance, their combination led to the most consistent and reliable results. These findings underscore the need to improve LLM robustness to temporal shifts to ensure more dependable applications in clinical practice.
Problem

Research questions and friction points this paper is trying to address.

Assessing medical knowledge drift in LLMs
Mitigating conflicting treatment suggestions
Improving temporal reliability in clinical guidelines
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed DriftMedQA benchmark for guideline evolution
Used Retrieval-Augmented Generation to enhance accuracy
Applied Direct Preference Optimization for fine-tuning
🔎 Similar Papers
No similar papers found.
Weiyi Wu
Weiyi Wu
Dartmouth College
computer visionLLMmultimodal model
X
Xinwen Xu
Medical Practice Evaluation Center, Massachusetts General Hospital, Boston, MA
Chongyang Gao
Chongyang Gao
Northwestern University
Natural Language ProcessingComputer Vision
Xingjian Diao
Xingjian Diao
Dartmouth College
MLLMVideo UnderstandingSpeech Understanding
S
Siting Li
Department of Biomedical Data Science, Dartmouth College, Hanover, NH
L
Lucas A. Salas
Department of Epidemiology, Dartmouth College, Hanover, NH
Jiang Gui
Jiang Gui
Dartmouth College
BiostatisticsMachine Learning