Cross-Subject Data Splitting for Brain-to-Text Decoding

πŸ“… 2023-12-18
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work identifies a pervasive data leakage issue in current cross-subject brain-to-text decoding research (using fMRI/EEG): prevailing data splitting strategies fail to enforce strict subject-level isolation, allowing test-subject information to contaminate the training setβ€”thereby inflating performance estimates and compromising generalization assessment. To address this, we propose the first subject-level strictly isolated data splitting protocol specifically designed for brain-to-text decoding, along with a unified multimodal splitting framework. Leveraging this framework, we rigorously re-evaluate state-of-the-art BERT-based decoding models across multiple public datasets, demonstrating that their reported cross-subject generalization capabilities are systematically overestimated. Our work eliminates evaluation bias, establishes a trustworthy cross-subject benchmark, and provides the field with a methodological standard and reproducible evaluation protocol for fair and reliable model assessment.
πŸ“ Abstract
Recent major milestones have successfully decoded non-invasive brain signals (e.g. functional Magnetic Resonance Imaging (fMRI) and electroencephalogram (EEG)) into natural language. Despite the progress in model design, how to split the datasets for training, validating, and testing still remains a matter of debate. Most of the prior researches applied subject-specific data splitting, where the decoding model is trained and evaluated per subject. Such splitting method poses challenges to the utilization efficiency of dataset as well as the generalization of models. In this study, we propose a cross-subject data splitting criterion for brain-to-text decoding on various types of cognitive dataset (fMRI, EEG), aiming to maximize dataset utilization and improve model generalization. We undertake a comprehensive analysis on existing cross-subject data splitting strategies and prove that all these methods suffer from data leakage, namely the leakage of test data to training set, which significantly leads to overfitting and overestimation of decoding models. The proposed cross-subject splitting method successfully addresses the data leakage problem and we re-evaluate some SOTA brain-to-text decoding models as baselines for further research.
Problem

Research questions and friction points this paper is trying to address.

Addressing data leakage in cross-subject brain-to-text decoding
Developing correct data splitting for fMRI and EEG signals
Re-evaluating SOTA models with proper validation criteria
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed cross-subject data splitting criterion
Prevented data leakage in brain signals
Re-evaluated SOTA decoding models correctly
πŸ”Ž Similar Papers
No similar papers found.
C
Congchi Yin
Nanjing University of Aeronautics and Astronautics, China
Qian Yu
Qian Yu
Professor, Dept of Earth, Geographic, and Climate Sciences, University of Massachusetts-Amherst
GISremote sensingSpatial modeling
Zhiwei Fang
Zhiwei Fang
JD.com, Beijing, China
J
Jie He
JD.com, Beijing, China
C
Changping Peng
JD.com, Beijing, China
Z
Zhangang Lin
JD.com, Beijing, China
J
Jingping Shao
JD.com, Beijing, China
P
Piji Li
Nanjing University of Aeronautics and Astronautics, China