Multi-View and Multi-Scale Alignment for Contrastive Language-Image Pre-training in Mammography

📅 2024-09-26

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address the challenges of label scarcity, minute lesion regions, and severe class imbalance in mammography—hindering effective adaptation of CLIP models—this paper proposes MaMA, the first end-to-end CLIP pretraining framework tailored for mammographic imaging. Methodologically, MaMA introduces a novel multi-view supervised contrastive learning strategy coupled with a symmetric local alignment module; integrates medical-knowledge-enhanced parameter-efficient fine-tuning of large language models; and incorporates a high-resolution local attention mechanism for image encoding. Evaluated on EMBED and RSNA-Mammo across classification, cross-modal retrieval, and zero-shot diagnosis tasks, MaMA consistently outperforms all existing state-of-the-art methods with substantial performance gains. Notably, its model size is only 52% of the largest baseline, achieving both computational efficiency and clinical practicality.

Technology Category

Application Category

📝 Abstract

Contrastive Language-Image Pre-training (CLIP) demonstrates strong potential in medical image analysis but requires substantial data and computational resources. Due to these restrictions, existing CLIP applications in medical imaging focus mainly on modalities like chest X-rays that have abundant image-report data available, leaving many other important modalities underexplored. Here, we propose one of the first adaptations of the full CLIP model to mammography, which presents significant challenges due to labeled data scarcity, high-resolution images with small regions of interest, and class-wise imbalance. We first develop a specialized supervision framework for mammography that leverages its multi-view nature. Furthermore, we design a symmetric local alignment module to better focus on detailed features in high-resolution images. Lastly, we incorporate a parameter-efficient fine-tuning approach for large language models pre-trained with medical knowledge to address data limitations. Our multi-view and multi-scale alignment (MaMA) method outperforms state-of-the-art baselines for three different tasks on two large real-world mammography datasets, EMBED and RSNA-Mammo, with only 52% model size compared with the largest baseline. The code is available at https://github.com/XYPB/MaMA

Problem

Research questions and friction points this paper is trying to address.

Adapting CLIP to mammography with limited labeled data

Handling high-resolution mammograms with small regions of interest

Addressing class imbalance in mammography image analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Specialized supervision framework for mammography

Symmetric local alignment for high-resolution images

Parameter-efficient fine-tuning for medical language models

🔎 Similar Papers

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training