🤖 AI Summary
Clinical evaluation of EEG foundation models suffers from inconsistent preprocessing, lack of standardized benchmarks, and insufficient validation of generalization under real-world distribution shifts.
Method: We propose the first unified evaluation framework tailored to authentic clinical settings, covering 11 neuropsychiatric diagnostic tasks across 14 publicly available datasets. Our protocol enforces minimal preprocessing, standardized data loading and splitting, explicit simulation of cross-site distributional shifts, and multi-task performance normalization.
Contribution/Results: This work establishes the first systematic out-of-distribution (OOD) clinical evaluation paradigm for EEG foundation models. Rigorous comparative experiments—spanning classical machine learning and Transformer-based architectures—reveal that while current foundation models excel on specific tasks, they frequently underperform lightweight traditional models under realistic distributional shifts. To foster reproducibility and trustworthiness, we fully open-source all datasets, code, and evaluation tools—advancing standardized, clinically grounded assessment of medical AI.
📝 Abstract
We introduce a unified benchmarking framework focused on evaluating EEG-based foundation models in clinical applications. The benchmark spans 11 well-defined diagnostic tasks across 14 publicly available EEG datasets, including epilepsy, schizophrenia, Parkinson's disease, OCD, and mild traumatic brain injury. It features minimal preprocessing, standardized evaluation protocols, and enables side-by-side comparisons of classical baselines and modern foundation models. Our results show that while foundation models achieve strong performance in certain settings, simpler models often remain competitive, particularly under clinical distribution shifts. To facilitate reproducibility and adoption, we release all prepared data and code in an accessible and extensible format.