🤖 AI Summary
Existing pathological foundation models predominantly rely on task-specific multiple-instance learning or unimodal unsupervised approaches, suffering from limited generalizability and neglecting textual semantics. To address this, we propose the first cross-modal unsupervised representation learning framework for whole-slide images (WSIs). Our method leverages large language models to automatically generate pathology-aware textual descriptions, enabling a fully self-supervised patch–text contrastive learning paradigm without manual annotations. We further introduce a cross-modal prototype alignment mechanism and a parameter-free attention-based aggregation strategy to achieve fine-grained semantic alignment and efficient slide-level representation learning. Evaluated on four public benchmarks, our approach significantly outperforms existing unsupervised methods and matches the performance of several weakly supervised baselines. It substantially improves generalization across downstream tasks—including classification, segmentation, and survival prediction—demonstrating robust cross-modal semantic understanding and scalable representation learning for digital pathology.
📝 Abstract
With the rapid advancement of pathology foundation models (FMs), the representation learning of whole slide images (WSIs) attracts increasing attention. Existing studies develop high-quality patch feature extractors and employ carefully designed aggregation schemes to derive slide-level representations. However, mainstream weakly supervised slide representation learning methods, primarily based on multiple instance learning (MIL), are tailored to specific downstream tasks, which limits their generalizability. To address this issue, some studies explore unsupervised slide representation learning. However, these approaches focus solely on the visual modality of patches, neglecting the rich semantic information embedded in textual data. In this work, we propose ProAlign, a cross-modal unsupervised slide representation learning framework. Specifically, we leverage a large language model (LLM) to generate descriptive text for the prototype types present in a WSI, introducing patch-text contrast to construct initial prototype embeddings. Furthermore, we propose a parameter-free attention aggregation strategy that utilizes the similarity between patches and these prototypes to form unsupervised slide embeddings applicable to a wide range of downstream tasks. Extensive experiments on four public datasets show that ProAlign outperforms existing unsupervised frameworks and achieves performance comparable to some weakly supervised models.