Consistent but Dangerous: Per-Sample Safety Classification Reveals False Reliability in Medical Vision-Language Models

📅 2026-03-21

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work addresses a critical reliability issue in medical vision-language models: their common reliance on consistency after prompt rewriting as a trustworthiness metric may obscure hazardous behaviors where models ignore visual inputs and depend solely on textual priors. To tackle this, the study introduces the first sample-level, four-quadrant safety taxonomy that jointly evaluates consistency and image dependence, thereby identifying “consistent yet dangerous” predictions. Through prompt-rewriting tests, image-ablation controls, LoRA fine-tuning, and entropy–accuracy analyses on MIMIC-CXR and PadChest, the authors find that while LoRA fine-tuning reduces prediction flipping, it shifts 98.5% of samples into the dangerous quadrant. Moreover, consistency exhibits a strong negative correlation with the proportion of dangerous samples (r = −0.89), underscoring the necessity of baseline textual behavior checks in model evaluation.

Technology Category

Application Category

📝 Abstract

Consistency under paraphrase, the property that semantically equivalent prompts yield identical predictions, is increasingly used as a proxy for reliability when deploying medical vision-language models (VLMs). We show this proxy is fundamentally flawed: a model can achieve perfect consistency by relying on text patterns rather than the input image. We introduce a four-quadrant per-sample safety taxonomy that jointly evaluates consistency (stable predictions across paraphrased prompts) and image reliance (predictions that change when the image is removed). Samples are classified as Ideal (consistent and image-reliant), Fragile (inconsistent but image-reliant), Dangerous (consistent but not image-reliant), or Worst (inconsistent and not image-reliant). Evaluating five medical VLM configurations across two chest X-ray datasets (MIMIC-CXR, PadChest), we find that LoRA fine-tuning dramatically reduces flip rates but shifts a majority of samples into the Dangerous quadrant: LLaVA-Rad Base achieves a 1.5% flip rate on PadChest while 98.5% of its samples are Dangerous. Critically, Dangerous samples exhibit high accuracy (up to 99.6%) and low entropy, making them invisible to standard confidence-based screening. We observe a negative correlation between flip rate and Dangerous fraction (r = -0.89, n=10) and recommend that deployment evaluations always pair consistency checks with a text-only baseline: a single additional forward pass that exposes the false reliability trap.

Problem

Research questions and friction points this paper is trying to address.

medical vision-language models

consistency under paraphrase

image reliance

false reliability

safety classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

safety taxonomy

consistency under paraphrase

image reliance