XLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training

📅 2024-07-28
🏛️ arXiv.org
📈 Citations: 9
Influential: 0
📄 PDF
🤖 AI Summary
Medical vision-language pretraining faces two key challenges: inaccurate reconstruction of pathological features—due to scarce annotated data—and suboptimal utilization of both paired and unpaired multimodal data. To address these, we propose a cross-modal interaction-driven joint masked modeling framework. Our contributions are threefold: (1) Attention-guided Masked Image Modeling (AttMIM), which enhances reconstruction fidelity of salient pathological regions via attention-based masking; (2) Entity-aware Masked Language Modeling (EntMLM) coupled with disease-category prompting, enabling the first unified optimization over both paired image-text and unpaired image-only or text-only data; and (3) an integrated training paradigm combining cross-modal attention, contrastive learning, and generative pretraining. Evaluated on five medical vision-language benchmarks, our method achieves state-of-the-art performance in both zero-shot and fine-tuned classification, significantly improving pathological representation learning and cross-task transfer robustness.

Technology Category

Application Category

📝 Abstract
Vision-and-language pretraining (VLP) in the medical field utilizes contrastive learning on image-text pairs to achieve effective transfer across tasks. Yet, current VLP approaches with the masked modelling strategy face two challenges when applied to the medical domain. First, current models struggle to accurately reconstruct key pathological features due to the scarcity of medical data. Second, most methods only adopt either paired image-text or image-only data, failing to exploit the combination of both paired and unpaired data. To this end, this paper proposes a XLIP (Masked modelling for medical Language-Image Pre-training) framework to enhance pathological learning and feature learning via unpaired data. First, we introduce the attention-masked image modelling (AttMIM) and entity-driven masked language modelling module (EntMLM), which learns to reconstruct pathological visual and textual tokens via multi-modal feature interaction, thus improving medical-enhanced features. The AttMIM module masks a portion of the image features that are highly responsive to textual features. This allows XLIP to improve the reconstruction of highly similar image data in medicine efficiency. Second, our XLIP capitalizes unpaired data to enhance multimodal learning by introducing disease-kind prompts. The experimental results show that XLIP achieves SOTA for zero-shot and fine-tuning classification performance on five datasets. Our code will be available at https://github.com/White65534/XLIP
Problem

Research questions and friction points this paper is trying to address.

Reconstructs key pathological features with limited medical data
Combines paired and unpaired data for multimodal learning
Improves medical image-text feature interaction via attention masking
Innovation

Methods, ideas, or system contributions that make the work stand out.

AttMIM and EntMLM enhance pathological feature learning
Uses disease-kind prompts for unpaired data integration
Improves medical image reconstruction via cross-modal attention
🔎 Similar Papers
No similar papers found.
B
Biao Wu
Australian Artificial Intelligence Institute, University of Technology Sydney, Ultimo NSW 2007, Australia
Y
Yutong Xie
Australian Institute for Machine Learning, The University of Adelaide, Adelaide SA 5005, Australia
Z
Zeyu Zhang
Australian National University, Canberra ACT 2601, Australia
M
Minh Hieu Phan
Australian Institute for Machine Learning, The University of Adelaide, Adelaide SA 5005, Australia
Q
Qi Chen
Australian Institute for Machine Learning, The University of Adelaide, Adelaide SA 5005, Australia
L
Ling Chen
Australian Artificial Intelligence Institute, University of Technology Sydney, Ultimo NSW 2007, Australia
Q
Qi Wu
Australian Institute for Machine Learning, The University of Adelaide, Adelaide SA 5005, Australia