Multimodal Prototype Alignment for Semi-supervised Pathology Image Segmentation

📅 2025-08-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Pathological image segmentation faces dual challenges: ambiguous semantic boundaries and high cost of pixel-level annotations. To address these, we propose MPAMatch—a novel framework that introduces text prototype supervision for the first time, establishing an image-text dual-prototype contrastive learning mechanism to jointly guide coarse structural and fine-grained semantic modeling. MPAMatch integrates pixel-wise contrastive learning, multimodal prototype alignment, and consistency regularization, while replacing the ViT backbone in TransUNet with the pathology-pretrained Uni model to enhance feature representation. Extensive experiments on benchmark datasets—including GLAS and EBHI-SEG-GLAND—demonstrate significant improvements over state-of-the-art methods, validating the synergistic gains in structural modeling and semantic understanding. The framework achieves superior segmentation accuracy, robustness to boundary ambiguity, and efficient utilization of limited annotated data.

Technology Category

Application Category

📝 Abstract
Pathological image segmentation faces numerous challenges, particularly due to ambiguous semantic boundaries and the high cost of pixel-level annotations. Although recent semi-supervised methods based on consistency regularization (e.g., UniMatch) have made notable progress, they mainly rely on perturbation-based consistency within the image modality, making it difficult to capture high-level semantic priors, especially in structurally complex pathology images. To address these limitations, we propose MPAMatch - a novel segmentation framework that performs pixel-level contrastive learning under a multimodal prototype-guided supervision paradigm. The core innovation of MPAMatch lies in the dual contrastive learning scheme between image prototypes and pixel labels, and between text prototypes and pixel labels, providing supervision at both structural and semantic levels. This coarse-to-fine supervisory strategy not only enhances the discriminative capability on unlabeled samples but also introduces the text prototype supervision into segmentation for the first time, significantly improving semantic boundary modeling. In addition, we reconstruct the classic segmentation architecture (TransUNet) by replacing its ViT backbone with a pathology-pretrained foundation model (Uni), enabling more effective extraction of pathology-relevant features. Extensive experiments on GLAS, EBHI-SEG-GLAND, EBHI-SEG-CANCER, and KPI show MPAMatch's superiority over state-of-the-art methods, validating its dual advantages in structural and semantic modeling.
Problem

Research questions and friction points this paper is trying to address.

Addresses ambiguous semantic boundaries in pathology image segmentation
Reduces reliance on costly pixel-level annotations through semi-supervised learning
Improves semantic boundary modeling with multimodal prototype alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal prototype-guided supervision for segmentation
Dual contrastive learning between image and text prototypes
Pathology-pretrained foundation model replacing ViT backbone
🔎 Similar Papers
No similar papers found.
M
Mingxi Fu
Shenzhen International Graduate School, Tsinghua University
F
Fanglei Fu
Shenzhen International Graduate School, Tsinghua University
Xitong Ling
Xitong Ling
Tsinghua University
AI4PathologyFoundation-ModelVision-Language-Model
H
Huaitian Yuan
Shenzhen International Graduate School, Tsinghua University
T
Tian Guan
Shenzhen International Graduate School, Tsinghua University
Yonghong He
Yonghong He
清华大学深圳国际研究生院
生物医学工程,光学成像,AI图像处理、病理大模型
L
Lianghui Zhu
Shenzhen International Graduate School, Tsinghua University