🤖 AI Summary
This study addresses the limited generalization of mitosis detection models in real-world clinical settings by constructing a comprehensive test set comprising 365 cases spanning 12 human and animal tumor types across multiple scanning platforms. It introduces, for the first time, an integrated evaluation framework encompassing multi-tumor, multi-species, and multi-context (hotspot, random, and challenging regions) scenarios, along with a novel atypical mitosis classification task. Through systematic assessment using large-scale data, domain generalization protocols, and model ensembling strategies, the study demonstrates that the best-performing model achieves an F1 score of 0.740, while the highest balanced accuracy for atypical mitosis classification reaches 0.908. Model ensembling yields average improvements of 1.5 and 1.3 percentage points in F1 score and accuracy, respectively, whereas test-time augmentation shows no significant benefit, revealing persistent performance gaps in challenging regions and rare tumor types.
📝 Abstract
Automated mitosis detection is a well-established task in computational pathology. While previous benchmarks focused on scanner-induced domain shift, clinical "real-world" application requires models to be robust across the vast variance to be expected in the histological landscape. The MItosis DOmain Generalization (MIDOG) 2025 challenge was designed to evaluate algorithmic performance across unprecedented biological and contextual diversity. We curated a test dataset of 365 cases, encompassing 12 distinct human, canine and feline tumor types, digitized across multiple scanning platforms. Moving beyond hand-selected hotspots, the challenge required detection also in random tissue areas (representative of the whole slide detection situation) and challenging areas (areas rich in hard negatives). In the second track, we introduced the classification of atypical mitotic figures (AMFs). There were 18 teams submitting to the detection track, with F1 scores ranging up to 0.740. In the AMF detection track, we had 21 submissions with balanced accuracy values up to 0.908. Our analysis reveals that while most models perform reliably in traditional hotspots, significant performance degradation occurs in challenging ROIs, where false positive rates tripled. Furthermore, performance varied significantly across the 12 tumor types, highlighting "blind spots" in current state-of-the-art architectures when encountering rare or highly pleomorphic malignancies. Moreover, we evaluated the effectiveness of ensembling and found a mean increases of 1.5 and 1.3 percentage points in F1 score and balanced accuracy, respectively. In contrast, TTA showed no relevant improvement. MIDOG 2025 demonstrates that "in the wild" mitosis detection remains a significant hurdle. The transition from hotspot-only evaluation to a multi-contextual framework provides a more realistic proxy for clinical reliability.