🤖 AI Summary
To address the critical bottleneck of lacking high-precision immune cell annotations in hematoxylin and eosin (H&E)-stained histopathology slides for tumor immune microenvironment (TIME) research, this work proposes the first fully automated, H&E–immunofluorescence (IF) registration-driven framework for constructing an immune cell database. Leveraging 6.84 million cellular instances—including 2.28 million immune cells—the framework integrates the Segment Anything Model (SAM), dual-modality image registration, weakly supervised segmentation, and multi-scale standardized cropping (64×64 at 40× magnification) to achieve nucleus-level localization and automated subtyping (CD4+, CD8+, CD20+, CD68+, CD163+). The resulting Immunocto database is the first publicly available, million-scale H&E-based immune cell resource featuring precise nuclear masks and clinically relevant subtype labels, substantially reducing reliance on manual annotation. Evaluated on lymphocyte detection, models trained on Immunocto achieve state-of-the-art performance.
📝 Abstract
With the advent of novel cancer treatment options such as immunotherapy, studying the tumour immune micro-environment (TIME) is crucial to inform on prognosis and understand potential response to therapeutic agents. A key approach to characterising the TIME may be through combining (1) digitised microscopic high-resolution optical images of hematoxylin and eosin (H&E) stained tissue sections obtained in routine histopathology examinations with (2) automated immune cell detection and classification methods. In this work, we introduce a workflow to automatically generate robust single cell contours and labels from dually stained tissue sections with H&E and multiplexed immunofluorescence (IF) markers. The approach harnesses the Segment Anything Model and requires minimal human intervention compared to existing single cell databases. With this methodology, we create Immunocto, a massive, multi-million automatically generated database of 6,848,454 human cells and objects, including 2,282,818 immune cells distributed across 4 subtypes: CD4$^+$ T cell lymphocytes, CD8$^+$ T cell lymphocytes, CD20$^+$ B cell lymphocytes, and CD68$^+$/CD163$^+$ macrophages. For each cell, we provide a 64$ imes$64 pixels$^2$ H&E image at $mathbf{40} imes$ magnification, along with a binary mask of the nucleus and a label. The database, which is made publicly available, can be used to train models to study the TIME on routine H&E slides. We show that deep learning models trained on Immunocto result in state-of-the-art performance for lymphocyte detection. The approach demonstrates the benefits of using matched H&E and IF data to generate robust databases for computational pathology applications.