PSF-Med: Measuring and Explaining Paraphrase Sensitivity in Medical Vision Language Models

๐Ÿ“… 2026-02-24
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the inconsistent performance of medical vision-language models (VLMs) on clinically equivalent questions phrased differentlyโ€”a critical deployment risk. The authors present the first systematic quantification of paraphrase sensitivity in this domain, introducing PSF-Med, a benchmark comprising nearly 20,000 chest X-ray questions and 92,000 paraphrases. By integrating sparse autoencoders, causal intervention, and feature clamping techniques, they demonstrate that low answer flip rates primarily stem from linguistic priors rather than visual understanding. The analysis identifies sparse features in layer 17 associated with question framing; targeted intervention on these features reduces flip rates by 31% with only a 1.3% drop in accuracy, substantially mitigating reliance on language priors.

Technology Category

Application Category

๐Ÿ“ Abstract
Medical Vision Language Models (VLMs) can change their answers when clinicians rephrase the same question, which raises deployment risks. We introduce Paraphrase Sensitivity Failure (PSF)-Med, a benchmark of 19,748 chest Xray questions paired with about 92,000 meaningpreserving paraphrases across MIMIC-CXR and PadChest. Across six medical VLMs, we measure yes/no flips for the same image and find flip rates from 8% to 58%. However, low flip rate does not imply visual grounding: text-only baselines show that some models stay consistent even when the image is removed, suggesting they rely on language priors. To study mechanisms in one model, we apply GemmaScope 2 Sparse Autoencoders (SAEs) to MedGemma 4B and analyze FlipBank, a curated set of 158 flip cases. We identify a sparse feature at layer 17 that correlates with prompt framing and predicts decision margin shifts. In causal patching, removing this feature's contribution recovers 45% of the yesminus-no logit margin on average and fully reverses 15% of flips. Acting on this finding, we show that clamping the identified feature at inference reduces flip rates by 31% relative with only a 1.3 percentage-point accuracy cost, while also decreasing text-prior reliance. These results suggest that flip rate alone is not enough; robustness evaluations should test both paraphrase stability and image reliance.
Problem

Research questions and friction points this paper is trying to address.

Paraphrase Sensitivity
Medical Vision Language Models
Robustness Evaluation
Language Priors
Answer Consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Paraphrase Sensitivity
Medical Vision Language Models
Sparse Autoencoders
Causal Patching
Language Priors
๐Ÿ”Ž Similar Papers
No similar papers found.