π€ AI Summary
This work identifies a pervasive data leakage issue in current cross-subject brain-to-text decoding research (using fMRI/EEG): prevailing data splitting strategies fail to enforce strict subject-level isolation, allowing test-subject information to contaminate the training setβthereby inflating performance estimates and compromising generalization assessment. To address this, we propose the first subject-level strictly isolated data splitting protocol specifically designed for brain-to-text decoding, along with a unified multimodal splitting framework. Leveraging this framework, we rigorously re-evaluate state-of-the-art BERT-based decoding models across multiple public datasets, demonstrating that their reported cross-subject generalization capabilities are systematically overestimated. Our work eliminates evaluation bias, establishes a trustworthy cross-subject benchmark, and provides the field with a methodological standard and reproducible evaluation protocol for fair and reliable model assessment.
π Abstract
Recent major milestones have successfully decoded non-invasive brain signals (e.g. functional Magnetic Resonance Imaging (fMRI) and electroencephalogram (EEG)) into natural language. Despite the progress in model design, how to split the datasets for training, validating, and testing still remains a matter of debate. Most of the prior researches applied subject-specific data splitting, where the decoding model is trained and evaluated per subject. Such splitting method poses challenges to the utilization efficiency of dataset as well as the generalization of models. In this study, we propose a cross-subject data splitting criterion for brain-to-text decoding on various types of cognitive dataset (fMRI, EEG), aiming to maximize dataset utilization and improve model generalization. We undertake a comprehensive analysis on existing cross-subject data splitting strategies and prove that all these methods suffer from data leakage, namely the leakage of test data to training set, which significantly leads to overfitting and overestimation of decoding models. The proposed cross-subject splitting method successfully addresses the data leakage problem and we re-evaluate some SOTA brain-to-text decoding models as baselines for further research.