Cross-Modal Prototype Allocation: Unsupervised Slide Representation Learning via Patch-Text Contrast in Computational Pathology

📅 2025-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing pathological foundation models predominantly rely on task-specific multiple-instance learning or unimodal unsupervised approaches, suffering from limited generalizability and neglecting textual semantics. To address this, we propose the first cross-modal unsupervised representation learning framework for whole-slide images (WSIs). Our method leverages large language models to automatically generate pathology-aware textual descriptions, enabling a fully self-supervised patch–text contrastive learning paradigm without manual annotations. We further introduce a cross-modal prototype alignment mechanism and a parameter-free attention-based aggregation strategy to achieve fine-grained semantic alignment and efficient slide-level representation learning. Evaluated on four public benchmarks, our approach significantly outperforms existing unsupervised methods and matches the performance of several weakly supervised baselines. It substantially improves generalization across downstream tasks—including classification, segmentation, and survival prediction—demonstrating robust cross-modal semantic understanding and scalable representation learning for digital pathology.

Technology Category

Application Category

📝 Abstract
With the rapid advancement of pathology foundation models (FMs), the representation learning of whole slide images (WSIs) attracts increasing attention. Existing studies develop high-quality patch feature extractors and employ carefully designed aggregation schemes to derive slide-level representations. However, mainstream weakly supervised slide representation learning methods, primarily based on multiple instance learning (MIL), are tailored to specific downstream tasks, which limits their generalizability. To address this issue, some studies explore unsupervised slide representation learning. However, these approaches focus solely on the visual modality of patches, neglecting the rich semantic information embedded in textual data. In this work, we propose ProAlign, a cross-modal unsupervised slide representation learning framework. Specifically, we leverage a large language model (LLM) to generate descriptive text for the prototype types present in a WSI, introducing patch-text contrast to construct initial prototype embeddings. Furthermore, we propose a parameter-free attention aggregation strategy that utilizes the similarity between patches and these prototypes to form unsupervised slide embeddings applicable to a wide range of downstream tasks. Extensive experiments on four public datasets show that ProAlign outperforms existing unsupervised frameworks and achieves performance comparable to some weakly supervised models.
Problem

Research questions and friction points this paper is trying to address.

Unsupervised slide representation learning for pathology images
Leveraging patch-text contrast for cross-modal prototype alignment
Generalizable slide embeddings for diverse downstream tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages LLM for WSI descriptive text generation
Uses patch-text contrast for prototype embeddings
Parameter-free attention aggregates patch-prototype similarity
🔎 Similar Papers
No similar papers found.
Y
Yuxuan Chen
Shenzhen International Graduate School, Tsinghua University, China
J
Jiawen Li
Shenzhen International Graduate School, Tsinghua University, China
Jiali Hu
Jiali Hu
City University of Hong Kong (Dongguan)
Optical ImagingDeep LearningComputer Vision
Xitong Ling
Xitong Ling
Tsinghua University
AI4PathologyFoundation-ModelVision-Language-Model
T
Tian Guan
Shenzhen International Graduate School, Tsinghua University, China
A
Anjia Han
Department of Pathology, The First Affiliated Hospital of Sun Yat-sen University, China
Yonghong He
Yonghong He
清华大学深圳国际研究生院
生物医学工程,光学成像,AI图像处理、病理大模型