π€ AI Summary
To address the poor generalizability of automated seizure detection across patients and clinical centers, this study introduces SzCOREβthe first benchmark explicitly designed for clinical generalization. SzCORE comprises 65 patients, 4,360 hours of multicenter continuous EEG recordings, and expert neurophysiologist annotations. It establishes a clinically relevant evaluation framework centered on event-level F1-score, prioritizing clinical utility over segment- or sample-level metrics. SzCORE features a standardized, plug-and-play event-level assessment protocol enabling continuous benchmarking and clinical interpretability validation. Among 28 algorithms submitted by 30 teams, the top-performing model achieved an event-level F1-score of 43% (sensitivity: 37%, precision: 45%), substantially outperforming prior challenge results and commercial systems. This sets a new state-of-the-art performance benchmark for clinically deployable EEG-based seizure detection.
π Abstract
Reliable automatic seizure detection from long-term EEG remains a challenge, as current machine learning models often fail to generalize across patients or clinical settings. Manual EEG review remains the clinical standard, underscoring the need for robust models and standardized evaluation. To rigorously assess algorithm performance, we organized a challenge using a private dataset of continuous EEG recordings from 65 subjects (4,360 hours). Expert neurophysiologists annotated the data, providing ground truth for seizure events. Participants were required to detect seizure onset and duration, with evaluation based on event-based metrics, including sensitivity, precision, F1-score, and false positives per day. The SzCORE framework ensured standardized evaluation. The primary ranking criterion was the event-based F1-score, reflecting clinical relevance by balancing sensitivity and false positives. The challenge received 30 submissions from 19 teams, with 28 algorithms evaluated. Results revealed wide variability in performance, with a top F1-score of 43% (sensitivity 37%, precision 45%), highlighting the ongoing difficulty of seizure detection. The challenge also revealed a gap between reported performance and real-world evaluation, emphasizing the importance of rigorous benchmarking. Compared to previous challenges and commercial systems, the best-performing algorithm in this contest showed improved performance. Importantly, the challenge platform now supports continuous benchmarking, enabling reproducible research, integration of new datasets, and clinical evaluation of seizure detection algorithms using a standardized framework.