đ¤ AI Summary
Foundational models exhibit weak cross-center generalization and lack robust evaluation metrics for stability under distribution shift in skin cancer subtype classification. Method: We introduce AI4SkIN, a multi-center whole-slide image benchmark, to systematically evaluate compositional pathology foundation models as patch-level feature extractors within a multiple-instance learning (MIL) framework. We propose the FM-Silhouette Index (FM-SI), a novel metric quantifying feature consistency under distribution shift, revealing the critical role of low-bias features for similarity-driven MIL classifiers. Results: Experiments demonstrate a strong positive correlation between feature quality and final classification accuracy. Our evaluation framework significantly improves complex subtype recognitionâparticularly enhancing the robustness of similarity-based MIL modelsâthereby establishing a reproducible, interpretable paradigm for real-world computational pathology foundation model assessment.
đ Abstract
Pretraining on large-scale, in-domain datasets grants histopathology foundation models (FM) the ability to learn task-agnostic data representations, enhancing transfer learning on downstream tasks. In computational pathology, automated whole slide image analysis requires multiple instance learning (MIL) frameworks due to the gigapixel scale of the slides. The diversity among histopathology FMs has highlighted the need to design real-world challenges for evaluating their effectiveness. To bridge this gap, our work presents a novel benchmark for evaluating histopathology FMs as patch-level feature extractors within a MIL classification framework. For that purpose, we leverage the AI4SkIN dataset, a multi-center cohort encompassing slides with challenging cutaneous spindle cell neoplasm subtypes. We also define the Foundation Model - Silhouette Index (FM-SI), a novel metric to measure model consistency against distribution shifts. Our experimentation shows that extracting less biased features enhances classification performance, especially in similarity-based MIL classifiers.