MV-MLM: Bridging Multi-View Mammography and Language for Breast Cancer Diagnosis and Risk Prediction

📅 2025-10-30
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
To address the reliance of breast cancer computer-aided diagnosis (CAD) models on costly, expert-annotated radiology reports, this paper proposes a cross-modal self-supervised learning framework that requires no real radiology reports. Methodologically, it introduces synthetic radiology reports paired with multi-view mammograms to construct vision–language pairs, integrating the CLIP architecture, multi-view image encoding, synthetic text generation, and pseudo-label alignment for joint vision–text self-supervision and multi-view supervision. The key contribution lies in enabling high-fidelity cross-modal representation learning solely via high-quality synthetic reports—eliminating the need for ground-truth textual annotations—thereby substantially improving data efficiency and generalization. Experiments demonstrate state-of-the-art performance across three clinical tasks: malignant tumor classification, molecular subtype identification, and breast cancer risk prediction—consistently outperforming fully supervised methods and existing vision–language model (VLM) baselines.

Technology Category

Application Category

📝 Abstract
Large annotated datasets are essential for training robust Computer-Aided Diagnosis (CAD) models for breast cancer detection or risk prediction. However, acquiring such datasets with fine-detailed annotation is both costly and time-consuming. Vision-Language Models (VLMs), such as CLIP, which are pre-trained on large image-text pairs, offer a promising solution by enhancing robustness and data efficiency in medical imaging tasks. This paper introduces a novel Multi-View Mammography and Language Model for breast cancer classification and risk prediction, trained on a dataset of paired mammogram images and synthetic radiology reports. Our MV-MLM leverages multi-view supervision to learn rich representations from extensive radiology data by employing cross-modal self-supervision across image-text pairs. This includes multiple views and the corresponding pseudo-radiology reports. We propose a novel joint visual-textual learning strategy to enhance generalization and accuracy performance over different data types and tasks to distinguish breast tissues or cancer characteristics(calcification, mass) and utilize these patterns to understand mammography images and predict cancer risk. We evaluated our method on both private and publicly available datasets, demonstrating that the proposed model achieves state-of-the-art performance in three classification tasks: (1) malignancy classification, (2) subtype classification, and (3) image-based cancer risk prediction. Furthermore, the model exhibits strong data efficiency, outperforming existing fully supervised or VLM baselines while trained on synthetic text reports and without the need for actual radiology reports.
Problem

Research questions and friction points this paper is trying to address.

Developing multi-view mammography-language model for breast cancer diagnosis
Enhancing data efficiency with synthetic radiology reports and cross-modal learning
Improving malignancy classification, subtype identification, and cancer risk prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-view mammography with synthetic radiology reports
Cross-modal self-supervision across image-text pairs
Joint visual-textual learning for cancer classification
🔎 Similar Papers
No similar papers found.
S
Shunjie-Fabian Zheng
Department of Medicine I, LMU University Hospital, LMU Munich, Germany
H
Hyeonjun Lee
Lunit Inc.
Thijs Kooi
Thijs Kooi
Lunit Inc.
Machine LearningMedical Image AnalysisComputer aided diagnosis
Ali Diba
Ali Diba
Lunit Inc.