THUNDER: Tile-level Histopathology image UNDERstanding benchmark

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

The proliferation of tile-level foundation models in digital pathology lacks a unified benchmark for systematically evaluating their feature representation capability, robustness, and uncertainty quantification—hindering clinically trustworthy deployment. Method: We introduce the first tile-level foundation model benchmark platform specifically designed for histopathological images, comprising four integrated modules: embedding analysis, downstream task evaluation (across 16 diverse datasets spanning classification, segmentation, and other tasks), robustness testing (e.g., stain variation, resolution degradation), and uncertainty quantification. Contribution/Results: This work enables the first横向 comparative evaluation of 23 state-of-the-art foundation models under standardized protocols; supports plug-and-play integration of user-defined models and dynamic benchmark expansion; and releases all code and evaluation pipelines open-source to foster reproducible, reliable, and clinically translatable AI research in computational pathology.

Technology Category

Application Category

📝 Abstract

Progress in a research field can be hard to assess, in particular when many concurrent methods are proposed in a short period of time. This is the case in digital pathology, where many foundation models have been released recently to serve as feature extractors for tile-level images, being used in a variety of downstream tasks, both for tile- and slide-level problems. Benchmarking available methods then becomes paramount to get a clearer view of the research landscape. In particular, in critical domains such as healthcare, a benchmark should not only focus on evaluating downstream performance, but also provide insights about the main differences between methods, and importantly, further consider uncertainty and robustness to ensure a reliable usage of proposed models. For these reasons, we introduce THUNDER, a tile-level benchmark for digital pathology foundation models, allowing for efficient comparison of many models on diverse datasets with a series of downstream tasks, studying their feature spaces and assessing the robustness and uncertainty of predictions informed by their embeddings. THUNDER is a fast, easy-to-use, dynamic benchmark that can already support a large variety of state-of-the-art foundation, as well as local user-defined models for direct tile-based comparison. In this paper, we provide a comprehensive comparison of 23 foundation models on 16 different datasets covering diverse tasks, feature analysis, and robustness. The code for THUNDER is publicly available at https://github.com/MICS-Lab/thunder.

Problem

Research questions and friction points this paper is trying to address.

Assessing progress in digital pathology research

Benchmarking tile-level foundation models for reliability

Comparing feature spaces and model robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tile-level benchmark for pathology foundation models

Evaluates feature spaces and prediction robustness

Supports diverse datasets and downstream tasks

🔎 Similar Papers

No similar papers found.

Authors to Follow