🤖 AI Summary
This study addresses the lack of systematic, model-agnostic evaluation of foundation models for pixel-level semantic segmentation in histopathology. The authors propose a fine-tuning-free evaluation framework that extracts attention maps from vision or vision-language foundation models—including CLIP, DINO, and CONCH—as pixel-level features and combines them with an XGBoost classifier to uniformly assess the segmentation performance of ten foundation models across four tissue- and cell-level pathology datasets. Experimental results demonstrate that CONCH achieves the best performance, followed by PathDino. Moreover, fusing features from CONCH, PathDino, and CellViT substantially enhances generalization, yielding an average segmentation performance gain of 7.95% across all datasets, thereby validating the effectiveness of complementary cross-model representations.
📝 Abstract
In recent years, foundation models such as CLIP, DINO,and CONCH have demonstrated remarkable domain generalization and unsupervised feature extraction capabilities across diverse imaging tasks. However, systematic and independent evaluations of these models for pixel-level semantic segmentation in histopathology remain scarce. In this study, we propose a robust benchmarking approach to asses 10 foundational models on four histopathological datasets covering both morphological tissue-region and cellular/nuclear segmentation tasks. Our method leverages attention maps of foundation models as pixel-wise features, which are then classified using a machine learning algorithm, XGBoost, enabling fast, interpretable, and model-agnostic evaluation without finetuning. We show that the vision language foundation model, CONCH performed the best across datasets when compared to vision-only foundation models, with PathDino as close second. Further analysis shows that models trained on distinct histopathology cohorts capture complementary morphological representations, and concatenating their features yields superior segmentation performance. Concatenating features from CONCH, PathDino and CellViT outperformed individual models across all the datasets by 7.95% (averaged across the datasets), suggesting that ensembles of foundation models can better generalize to diverse histopathological segmentation tasks.