Deep Unsupervised Anomaly Detection in Brain Imaging: Large-Scale Benchmarking and Bias Analysis

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Unsupervised anomaly detection in brain MRI faces critical bottlenecks—including fragmented evaluation protocols, dataset heterogeneity, and inconsistent metrics—that impede clinical translation. Method: We introduce the first large-scale, multi-center benchmark comprising 2,221 T1-weighted and 1,262 T2-weighted images, enabling systematic evaluation of reconstruction-based (including diffusion-inspired) and feature-based methods across diverse scanners and populations. Contribution/Results: We uncover previously unreported systematic biases in algorithm performance with respect to lesion type, size, and demographic variables (age, sex), demonstrating that algorithmic limitations—not data scarcity—are the primary barrier. Quantitatively, reconstruction methods achieve superior lesion segmentation (Dice 0.03–0.65), whereas feature-based approaches exhibit stronger out-of-distribution robustness. Scanner variability significantly degrades generalization. This benchmark establishes a reproducible foundation for standardized evaluation and principled algorithmic advancement.

Technology Category

Application Category

📝 Abstract
Deep unsupervised anomaly detection in brain magnetic resonance imaging offers a promising route to identify pathological deviations without requiring lesion-specific annotations. Yet, fragmented evaluations, heterogeneous datasets, and inconsistent metrics have hindered progress toward clinical translation. Here, we present a large-scale, multi-center benchmark of deep unsupervised anomaly detection for brain imaging. The training cohort comprised 2,976 T1 and 2,972 T2-weighted scans from healthy individuals across six scanners, with ages ranging from 6 to 89 years. Validation used 92 scans to tune hyperparameters and estimate unbiased thresholds. Testing encompassed 2,221 T1w and 1,262 T2w scans spanning healthy datasets and diverse clinical cohorts. Across all algorithms, the Dice-based segmentation performance varied between 0.03 and 0.65, indicating substantial variability. To assess robustness, we systematically evaluated the impact of different scanners, lesion types and sizes, as well as demographics (age, sex). Reconstruction-based methods, particularly diffusion-inspired approaches, achieved the strongest lesion segmentation performance, while feature-based methods showed greater robustness under distributional shifts. However, systematic biases, such as scanner-related effects, were observed for the majority of algorithms, including that small and low-contrast lesions were missed more often, and that false positives varied with age and sex. Increasing healthy training data yields only modest gains, underscoring that current unsupervised anomaly detection frameworks are limited algorithmically rather than by data availability. Our benchmark establishes a transparent foundation for future research and highlights priorities for clinical translation, including image native pretraining, principled deviation measures, fairness-aware modeling, and robust domain adaptation.
Problem

Research questions and friction points this paper is trying to address.

Develops a large-scale benchmark for unsupervised brain MRI anomaly detection
Evaluates algorithm robustness across scanners, lesion types, and demographics
Identifies systematic biases and limitations in current unsupervised detection frameworks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale multi-center benchmark for brain anomaly detection
Reconstruction-based methods, especially diffusion-inspired, show best performance
Systematic biases identified, highlighting need for fairness-aware modeling
🔎 Similar Papers
No similar papers found.
A
Alexander Frotscher
Department of Psychiatry and Psychotherapy, University Hospital Tübingen, Tübingen, Germany
Christian F. Baumgartner
Christian F. Baumgartner
University of Tübingen & University of Lucerne
Machine LearningMedical Image Analysis
T
Thomas Wolfers
Department of Psychology, Friedrich Schiller University of Jena, Germany