Cross-modal linkage risk in clinical vision-language models

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This study addresses a critical cross-modal re-identification privacy risk in clinical vision-language models, where de-identified medical images may be inadvertently linked back to their original textual reports when image and report data are processed separately. The work presents the first systematic quantification of this risk by formulating an image-to-report retrieval task and evaluating re-linking capabilities across multiple models on large-scale chest X-ray datasets. To mitigate this vulnerability, the authors propose a novel differential privacy fine-tuning strategy applied exclusively to the alignment layer’s projection head (ε=0.34, δ=6×10⁻⁶), avoiding costly retraining of the backbone model. Experiments reveal that the strongest model achieves a recall 50 times higher than random chance at N=10,000; after privacy-preserving optimization, Recall@1 on MIMIC-CXR drops by 61.8%, while AUROC for 14 disease classifications declines by only 0.2%. The approach demonstrates strong generalization on CheXpert Plus, effectively balancing privacy preservation with representational utility.

📝 Abstract

Vision-language models (VLMs) trained on paired chest radiographs and radiology reports learn a shared embedding space that can preserve instance-level image-report correspondence. This poses a privacy risk in settings where radiographs and reports are deliberately kept separate after acquisition, such as image-only data sharing or access-controlled reports, because a de-identified image may be re-linked to its original narrative report through cosine similarity alone. We formalized this as image-to-report retrieval and used public paired cohorts, in which the true pairing is known by design, as ground-truth benchmarks to audit the risk rather than as the privacy scenario. Evaluating VLMs of increasing clinical specialization on 406,241 paired examples from 126,804 patients across MIMIC-CXR (43,793 held-out pairs) and external CheXpert Plus (29,296 pairs), we found that re-linkage rose systematically with specialization: the strongest VLM retrieved the correct report at 15 times chance at a candidate pool of N = 100, 50 times chance at N = 10,000, and well above chance at full-database scale. The signal persisted under pathology-matched hard negatives that removed disease-label shortcuts, indicating correspondence beyond broad diagnostic categories. To reduce it without retraining, we froze both encoders and applied differentially private optimization only to the projection heads defining the alignment layer (epsilon = 0.34, delta = 6x10-6). This reduced Recall@1 by 61.8% at N = 10,000 on MIMIC-CXR and transferred to CheXpert Plus without retraining, while image-side utility was largely preserved: macro AUROC for linear-probe classification across 14 labels shifted only from 79.63% to 79.43%. Targeted DP finetuning of the shared alignment layer can substantially reduce cross-modal re-linkage without materially degrading the image representations that make these models clinically useful.

Problem

Research questions and friction points this paper is trying to address.

cross-modal linkage

privacy risk

vision-language models

re-identification

clinical data

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-modal privacy

vision-language models

differential privacy