🤖 AI Summary
To address privacy-compliant sharing of sensitive medical data, this paper proposes a data-driven method for automatically learning and generating Synthea-based synthetic patient generation rules from real-world cancer registry data, using glioblastoma as a case study. Methodologically, we design an end-to-end automated pipeline integrating structured data preprocessing, conditional probabilistic modeling of disease progression pathways, and rule mapping and compilation into Synthea modules—enabling, for the first time, fully automated rule generation without clinical expert intervention. Our key contribution lies in shifting from knowledge-intensive, clinician-guided modeling to purely data-driven inference of disease evolution logic. Experimental evaluation demonstrates that the generated synthetic data achieves <5% error across critical statistical metrics—including age distribution, diagnostic timing, and comorbidity patterns—faithfully reproducing real-world clinical characteristics. This work establishes a scalable, reproducible, and privacy-preserving framework for synthetic health data generation.
📝 Abstract
The generation of synthetic data is a promising technology to make medical data available for secondary use in a privacy-compliant manner. A popular method for creating realistic patient data is the rule-based Synthea data generator. Synthea generates data based on rules describing the lifetime of a synthetic patient. These rules typically express the probability of a condition occurring, such as a disease, depending on factors like age. Since they only contain statistical information, rules usually have no specific data protection requirements. However, creating meaningful rules can be a very complex process that requires expert knowledge and realistic sample data. In this paper, we introduce and evaluate an approach to automatically generate Synthea rules based on statistics from tabular data, which we extracted from cancer reports. As an example use case, we created a Synthea module for glioblastoma from a real-world dataset and used it to generate a synthetic dataset. Compared to the original dataset, the synthetic data reproduced known disease courses and mostly retained the statistical properties. Overall, synthetic patient data holds great potential for privacy-preserving research. The data can be used to formulate hypotheses and to develop prototypes, but medical interpretation should consider the specific limitations as with any currently available approach.