🤖 AI Summary
Existing synthetic electronic health record (EHR) models are difficult to fairly compare and reproduce due to fragmented codebases, incompatible data loading pipelines, and inconsistent evaluation protocols. To address this, this work introduces a lightweight, end-to-end benchmarking framework built upon PyHealth that uniformly integrates prominent models—including MedGAN, CorGAN, PromptEHR, HALO, and a compact GPT-2—enabling standardized training and architecture-agnostic evaluation across the full ICD-9 vocabulary. For the first time, generative adversarial network (GAN)-based and Transformer-based models are systematically compared under a unified protocol. The study further proposes a joint privacy–utility evaluation suite and uncovers critical performance bottlenecks of current methods in handling long-tailed diagnostic distributions. This framework substantially lowers the engineering barrier for synthetic EHR research and advances reproducibility and standardized benchmark development in the field.
📝 Abstract
The generation of high-fidelity synthetic Electronic Health Records (EHR) is crucial for advancing medical research while preserving patient privacy. However, head-to-head comparison of existing generative models is hindered by disjointed codebases, incompatible data loaders, conflicting library dependencies, and inconsistent evaluation protocols. To address these gaps, we introduce a lightweight, end-to-end benchmarking framework for reproducible synthetic EHR evaluation, organized as a unified pipeline spanning data ingestion, standardized model training, and architecture-agnostic evaluation. Our current implementation targets the generation of longitudinal ICD diagnosis codes -- the most commonly studied modality in this literature -- and is built on the community-maintained PyHealth library. We reimplement and unify strong baselines (MedGAN, CorGAN, PromptEHR, HALO) under full ICD-9 vocabulary granularity, and add a lightweight GPT-2 baseline from the general-purpose sequence-modeling literature. We contribute a rigorous, architecture-agnostic privacy-utility evaluation suite that applies identically to GAN- and transformer-based generators, and report bootstrapped confidence intervals across all metrics. We further analyze the poor long-tailed performance of existing models and discuss the extensibility of our framework beyond diagnosis codes. By lowering the engineering barrier to running, extending, and evaluating under a single pipeline, we introduce a starting point for community-driven reproducibility and benchmarking synthetic EHR models.