Medical large language models are easily distracted

📅 2025-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical large language models (LLMs) suffer significant performance degradation in real-world clinical settings due to interference from irrelevant information—such as ambiguous terms used in non-clinical senses or incidental mentions of unrelated diseases. To address this gap, we introduce MedDistractQA, the first benchmark explicitly designed to evaluate clinical information filtering capability. It employs USMLE-style questions and systematically injects diverse clinical distractors. Our evaluation reveals that existing models lack intrinsic logical mechanisms to distinguish clinically relevant from irrelevant information. Experiments show distractors reduce accuracy by up to 17.9%. Notably, retrieval-augmented generation (RAG) and domain-specific fine-tuning fail to improve robustness; in some configurations, they exacerbate performance degradation. This work establishes a novel benchmark for assessing medical LLM robustness, uncovers a critical limitation in current clinical reasoning architectures, and opens new research directions toward building more reliable, context-aware medical AI systems.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have the potential to transform medicine, but real-world clinical scenarios contain extraneous information that can hinder performance. The rise of assistive technologies like ambient dictation, which automatically generates draft notes from live patient encounters, has the potential to introduce additional noise making it crucial to assess the ability of LLM's to filter relevant data. To investigate this, we developed MedDistractQA, a benchmark using USMLE-style questions embedded with simulated real-world distractions. Our findings show that distracting statements (polysemous words with clinical meanings used in a non-clinical context or references to unrelated health conditions) can reduce LLM accuracy by up to 17.9%. Commonly proposed solutions to improve model performance such as retrieval-augmented generation (RAG) and medical fine-tuning did not change this effect and in some cases introduced their own confounders and further degraded performance. Our findings suggest that LLMs natively lack the logical mechanisms necessary to distinguish relevant from irrelevant clinical information, posing challenges for real-world applications. MedDistractQA and our results highlights the need for robust mitigation strategies to enhance LLM resilience to extraneous information.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' ability to filter clinical data from noise
Evaluating impact of distractions on medical LLM accuracy
Identifying lack of native logical mechanisms in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed MedDistractQA benchmark for distractions
Tested RAG and fine-tuning solutions
Highlighted need for robust mitigation strategies
🔎 Similar Papers
No similar papers found.
K
Krithik Vishwanath
Department of Aerospace Engineering and Engineering Mechanics, Department of Mathematics, The University of Texas at Austin, Austin, Texas, 78712
Anton Alyakin
Anton Alyakin
medical student at washington univesity
llmsneurosurgerynetworkscausality
D
D. Alber
Department of Neurological Surgery, NYU Langone Medical Center, New York, New York, 10016
J
Jin Vivian Lee
Department of Neurosurgery, Washington University School of Medicine in St. Louis, St. Louis, Missouri, 63110
D
Douglas Kondziolka
Department of Neurological Surgery, NYU Langone Medical Center, New York, New York, 10016
Eric Karl Oermann
Eric Karl Oermann
New York University
Artificial IntelligenceHuman Intelligence