🤖 AI Summary
Medical large language models (LLMs) suffer significant performance degradation in real-world clinical settings due to interference from irrelevant information—such as ambiguous terms used in non-clinical senses or incidental mentions of unrelated diseases. To address this gap, we introduce MedDistractQA, the first benchmark explicitly designed to evaluate clinical information filtering capability. It employs USMLE-style questions and systematically injects diverse clinical distractors. Our evaluation reveals that existing models lack intrinsic logical mechanisms to distinguish clinically relevant from irrelevant information. Experiments show distractors reduce accuracy by up to 17.9%. Notably, retrieval-augmented generation (RAG) and domain-specific fine-tuning fail to improve robustness; in some configurations, they exacerbate performance degradation. This work establishes a novel benchmark for assessing medical LLM robustness, uncovers a critical limitation in current clinical reasoning architectures, and opens new research directions toward building more reliable, context-aware medical AI systems.
📝 Abstract
Large language models (LLMs) have the potential to transform medicine, but real-world clinical scenarios contain extraneous information that can hinder performance. The rise of assistive technologies like ambient dictation, which automatically generates draft notes from live patient encounters, has the potential to introduce additional noise making it crucial to assess the ability of LLM's to filter relevant data. To investigate this, we developed MedDistractQA, a benchmark using USMLE-style questions embedded with simulated real-world distractions. Our findings show that distracting statements (polysemous words with clinical meanings used in a non-clinical context or references to unrelated health conditions) can reduce LLM accuracy by up to 17.9%. Commonly proposed solutions to improve model performance such as retrieval-augmented generation (RAG) and medical fine-tuning did not change this effect and in some cases introduced their own confounders and further degraded performance. Our findings suggest that LLMs natively lack the logical mechanisms necessary to distinguish relevant from irrelevant clinical information, posing challenges for real-world applications. MedDistractQA and our results highlights the need for robust mitigation strategies to enhance LLM resilience to extraneous information.