From Patches to Patients: A study of the tile-to-slide performance transferability in Digital Pathology

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the high computational cost of evaluating whole-slide image (WSI)-level foundation models in digital pathology by proposing tile-level linear probing as an efficient proxy for slide-level performance. For the first time, the authors systematically validate this approach across 19 state-of-the-art foundation models on 42 slide-level and 16 tile-level tasks, using ABMIL and mean pooling aggregation strategies. They demonstrate a strong correlation between tile-level and slide-level performance, indicating that tile-level benchmarking can effectively pre-select models likely to excel at the slide level, substantially reducing model selection costs. Furthermore, the transferability of this proxy is primarily influenced by cohort size and the number of tiles per slide, rather than task difficulty.

📝 Abstract

Foundation Models (FMs) have recently redefined the state-of-the-art in histopathology by providing robust representations for whole-slide image (WSI) analysis. However, selecting the optimal foundation model (FM) for a specific clinical cohort currently requires multiple preprocessing steps, followed by computationally expensive feature extraction and the training of a Multiple Instance Learning (MIL) aggregator for every model. In this work, we investigate whether efficient tile-level linear probing can serve as a reliable proxy for slide-level performance, reducing the need to run full slide-level pipelines for every candidate encoder. We benchmark 19 state-of-the-art FMs on 42 slide-level and 16 tile-level tasks, comparing tile probing metrics against slide-level outcomes using ABMIL and Mean Pooling aggregations. We observe a high correlation between tile and slide performance across varying task difficulties, indicating that encoder representation quality is the primary determinant of WSI success. Sensitivity analyses show that transferability is stable across models and is more influenced by cohort sizes and numbers of tiles per slide than by average task difficulty. We also measure the agreement in best performing models between tile and slide-level tasks, showing tile benchmarks reliably shortlist strong candidates. Overall, our study indicates that tile-level benchmarking provides an efficient and practical first step for narrowing down candidate models, while slide-level evaluation remains essential for final validation on clinical tasks.

Problem

Research questions and friction points this paper is trying to address.

Digital Pathology

Foundation Models

Whole-slide Image

Model Selection

Tile-level Benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

tile-level linear probing

foundation models

whole-slide image analysis