🤖 AI Summary
This study addresses metadata privacy risks in neuroimaging data sharing by systematically evaluating re-identification vulnerabilities in publicly available BIDS-formatted datasets. We developed and applied metaprivBIDS—a novel tool enabling the first automated, standardized privacy audit of tabular metadata (e.g., demographics, clinical scores)—integrating statistical and semantic analyses to detect cross-population differences in de-identification efficacy. Results indicate low re-identification risk for clinical scores, whereas demographic variables—including age, sex, and nationality—constitute the primary privacy bottleneck. While most datasets exhibit no critical vulnerabilities, widespread mild information leakage persists and remains exploitable. Based on these findings, we propose a tiered mitigation strategy. This work establishes a reproducible, scalable privacy assessment framework for neuroscientific data governance, grounded in empirical evidence and aligned with FAIR and GDPR principles.
📝 Abstract
The ethical and legal imperative to share research data without causing harm requires careful attention to privacy risks. While mounting evidence demonstrates that data sharing benefits science, legitimate concerns persist regarding the potential leakage of personal information that could lead to reidentification and subsequent harm. We reviewed metadata accompanying neuroimaging datasets from six heterogeneous studies openly available on OpenNeuro, involving participants across the lifespan, from children to older adults, with and without clinical diagnoses, and including associated clinical score data. Using metaprivBIDS (https://github.com/CPernet/metaprivBIDS), a novel tool for the systematic assessment of privacy in tabular data, we found that privacy is generally well maintained, with serious vulnerabilities being rare. Nonetheless, minor issues were identified in nearly all datasets and warrant mitigation. Notably, clinical score data (e.g., neuropsychological results) posed minimal reidentification risk, whereas demographic variables (age, sex, race, income, and geolocation) represented the principal privacy vulnerabilities. We outline practical measures to address these risks, enabling safer data sharing practices.