🤖 AI Summary
This study addresses a critical vulnerability in medical vision-language models (VLMs) when answering chest X-ray (CXR) questions: susceptibility to negation-induced polarity reversal errors, where models contradict visual evidence by misinterpreting negatively phrased options as indicating “no abnormality,” posing significant clinical risks. The work introduces and formally quantifies this “negation attraction” phenomenon, establishing CXR-ContraBench—a diagnostic benchmark focused on existence-based findings—to systematically evaluate such failures. To mitigate this without retraining, the authors propose QCCV-Neg, a deterministic verification mechanism applied at inference time that integrates chain-of-thought prompting with question-condition consistency validation to correct polarity confusion. Experiments reveal that state-of-the-art models MedGemma and Qwen2.5-VL initially achieve only ~31% accuracy on this task, but QCCV-Neg boosts performance to over 96%, substantially reducing high-stakes clinical errors.
📝 Abstract
When a chest X-ray shows consolidation but the question asks which finding is present, a medical vision-language model may answer "No consolidation." This is more than an incorrect choice: it is a polarity reversal that emits a clinical statement contradicting the image. We study this failure as negated-option attraction, where a model is drawn to a negated answer option even when it conflicts with both the visual evidence and the question. We introduce CXR-ContraBench (Chest X-Ray Contradiction Benchmark), a diagnostic benchmark spanning internal ReXVQA slices and external OpenI and CheXpert protocols. The benchmark centers on present-finding questions, where selecting "No X" despite visible X creates the main clinical risk, and uses absent-finding questions as secondary tests of whether models copy negated wording. Across CheXpert protocols, the failure is substantial and persistent. On a strict direct presence probe, MedGemma and Qwen2.5-VL reach only 31.49% and 30.21% accuracy, respectively; on a matched 135,754-record CheXpert training-split protocol, both models select negated options on over 62% of presence questions. Chain-of-thought prompting reduces some presence-side reversals but does not eliminate them and can amplify absence-side contradictions. Finally, QCCV-Neg (Question-Conditioned Consistency Verifier for Negation) deterministically repairs the measured polarity-confused subset without retraining, raising MedGemma and Qwen2.5-VL to 96.60% and 95.32% accuracy on the direct presence probe. These results show that standard accuracy can hide a clinically meaningful inference-time polarity failure. Source code and benchmark construction scripts are available at https://github.com/fangzr/cxr-contrabench-code.