BreastGPT: A Multimodal Large Language Model for the Full Spectrum of Breast Cancer Clinical Routine

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing multimodal large language models in medicine struggle to support end-to-end clinical reasoning—from breast cancer screening and diagnosis to treatment planning—due to data scarcity and limited generalization. To address this, this work proposes BreastGPT, a unified multimodal large language model, along with BreastStage, the first instruction-tuning dataset encompassing the full clinical workflow of breast cancer. BreastStage integrates five imaging modalities and 136 task templates. The model features a novel dual-branch visual encoder and a concept-preserving image token compression mechanism, enabling efficient cross-modal and cross-stage fusion. Evaluated on BreastStage-Bench, BreastGPT achieves 75.66% accuracy on closed-ended tasks and 89.92% on open-ended tasks, significantly outperforming both general-purpose and medical-specific multimodal models.

📝 Abstract

Breast cancer remains a leading cause of cancer-related mortality among women. Its clinical management requires multimodal reasoning across a clinical workflow that spans \textit{screening}, \textit{diagnosis} and \textit{treatment planning}, where each stage involves distinct imaging modalities, task objectives, and reasoning patterns. However, constrained by data scarcity and model versatility, existing medical MLLMs are typically evaluated on isolated modalities or narrow task families, limiting their ability to support workflow-level clinical reasoning. In this work, we first introduce \textbf{BreastStage}, a workflow-aligned breast imaging instruction corpus comprising 1.86M instruction-following pairs curated from 17 sub-datasets across 5 imaging modalities and 136 task templates. Its held-out split, \textbf{BreastStage-Bench}, provides a comprehensive benchmark for evaluating multimodal reasoning across the breast cancer care continuum. Building on this corpus, we propose \textbf{BreastGPT}, a unified MLLM equipped with a dual-branch visual encoder and concept-preserving token compression to bridge the scale gap between standard radiology and gigapixel pathology. On BreastStage-Bench, BreastGPT achieves 75.66\% closed-ended accuracy and 89.92\% open-ended score, outperforming both general-purpose and medical-specific MLLMs across clinical stages and task formats. These results suggest that workflow-aligned data and cross-scale visual modeling are critical for clinically grounded medical MLLMs. All data, code, and model checkpoints are released at https://yangyy-liu.github.io/BreastGPT.io.

Problem

Research questions and friction points this paper is trying to address.

multimodal large language model

breast cancer clinical workflow

data scarcity

workflow-level reasoning

medical MLLM

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal large language model

workflow-aligned dataset

dual-branch visual encoder