HISTAI: An Open-Source, Large-Scale Whole Slide Image Dataset for Computational Pathology

📅 2025-05-17

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing public whole-slide image (WSI) datasets suffer from limited scale, narrow tissue-type coverage, and insufficient clinical metadata, severely constraining the generalizability and clinical applicability of AI models in computational pathology. To address these limitations, we introduce the first open-source, large-scale, multimodal WSI dataset—comprising over 60,000 WSIs spanning 12 distinct tissue types—each annotated with diagnostic conclusions, demographic information, fine-grained region-level labels, and DICOM-SR–compliant structured metadata. Our dataset uniquely integrates three pillars: ultra-large-scale acquisition, multidimensional clinical annotation, and fully open sharing—specifically optimized for foundation model pretraining and interpretable downstream evaluation. Publicly released, it significantly enhances cross-tissue generalization, clinical interpretability, and real-world robustness of pathology AI models.

Technology Category

Application Category

📝 Abstract

Recent advancements in Digital Pathology (DP), particularly through artificial intelligence and Foundation Models, have underscored the importance of large-scale, diverse, and richly annotated datasets. Despite their critical role, publicly available Whole Slide Image (WSI) datasets often lack sufficient scale, tissue diversity, and comprehensive clinical metadata, limiting the robustness and generalizability of AI models. In response, we introduce the HISTAI dataset, a large, multimodal, open-access WSI collection comprising over 60,000 slides from various tissue types. Each case in the HISTAI dataset is accompanied by extensive clinical metadata, including diagnosis, demographic information, detailed pathological annotations, and standardized diagnostic coding. The dataset aims to fill gaps identified in existing resources, promoting innovation, reproducibility, and the development of clinically relevant computational pathology solutions. The dataset can be accessed at https://github.com/HistAI/HISTAI.

Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale diverse WSI datasets for AI

Insufficient clinical metadata in existing pathology datasets

Need for open-access resources to enhance computational pathology

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale open-source WSI dataset

Multimodal slides with clinical metadata

Standardized diagnostic coding for pathology

🔎 Similar Papers

No similar papers found.