Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the safety risks posed by clinical large language models (LLMs) that often yield inconsistent diagnoses when presented with semantically equivalent but linguistically varied inputs. To systematically evaluate diagnostic stability, the authors propose a semantic validation framework grounded in natural language inference (NLI), which generates meaning-preserving prompt variants and integrates LLM-based judgments with clinical expert review. Three novel metrics—Mean Variance Score (MVS), Diagnostic Consistency Delta (ΔC), and Weighted Consistency Index (WCI)—are introduced to quantify model sensitivity to semantically invariant rephrasings. Experiments across 16 general-purpose and medical LLMs reveal heterogeneous robustness among domain-specific models, with some outperforming general models, though the latter still demonstrate considerable competitiveness in diagnostic consistency.

📝 Abstract

Large Language Models (LLMs) are increasingly used in clinical applications. However, their behavior remains highly sensitive to subtle linguistic variations, such as rephrasing or syntactic variation. This sensitivity poses risks in safety-critical healthcare settings, where semantically equivalent inputs should produce consistent predictions. However, a key challenge is to ensure that prompt variations truly preserve clinical meaning, as embedding-based similarity metrics often fail to capture distinctions involving negation, temporality, or severity. To address this limitation, we propose a semantic verification framework based on Natural Language Inference (NLI) to filter meaning-preserving prompt variations, which are further refined using an LLM-as-a-judge and audited by a clinical expert. In addition, we introduce three metrics to quantify model sensitivity: MeaningPreserving Variation Sensitivity (MVS), confidence variation (ΔC), and Worst-Case Instability (WCI). We evaluate 16 open-source general-purpose (GP) and medical LLMs within the same model families and parameter scales, using reformulated prompts derived from the DiagnosisQA and MedQA datasets. Our results demonstrate that robustness differences between domain-specific (DS) models are mixed and highly model-dependent, i.e., domain specialization does not consistently improve or reduce robustness to meaning-preserving prompt reformulations. Several DS models rank among the most robust (when compared with GP counterparts), and strong GP baselines remain competitive as well.

Problem

Research questions and friction points this paper is trying to address.

semantic stability

clinical LLMs

prompt variation

meaning preservation

diagnostic consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Stability

Natural Language Inference

Clinical LLMs