PEHRT: A Common Pipeline for Harmonizing Electronic Health Record data for Translational Research

📅 2025-09-10

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This study addresses the challenge of collaborative analysis across multi-institutional electronic health record (EHR) systems, hindered by data heterogeneity, semantic inconsistency, and stringent privacy constraints. We propose a unified two-module framework that enables privacy-preserving EHR harmonization across institutions and disparate data models—without sharing raw individual-level data. The framework integrates standardized clinical coding mapping with machine learning–driven representation learning. Accompanied by open-source software and step-by-step implementation tutorials, it supports end-to-end translational research. Empirical validation across multiple real-world healthcare systems demonstrates substantial improvements in data interoperability and reusability, enabling the construction of high-quality, research-ready EHR datasets. The approach exhibits strong generalizability, scalability, and clinical deployability.

Technology Category

Application Category

📝 Abstract

Integrative analysis of multi-institutional Electronic Health Record (EHR) data enhances the reliability and generalizability of translational research by leveraging larger, more diverse patient cohorts and incorporating multiple data modalities. However, harmonizing EHR data across institutions poses major challenges due to data heterogeneity, semantic differences, and privacy concerns. To address these challenges, we introduce $ extit{PEHRT}$, a standardized pipeline for efficient EHR data harmonization consisting of two core modules: (1) data pre-processing and (2) representation learning. PEHRT maps EHR data to standard coding systems and uses advanced machine learning to generate research-ready datasets without requiring individual-level data sharing. Our pipeline is also data model agnostic and designed for streamlined execution across institutions based on our extensive real-world experience. We provide a complete suite of open source software, accompanied by a user-friendly tutorial, and demonstrate the utility of PEHRT in a variety of tasks using data from diverse healthcare systems.

Problem

Research questions and friction points this paper is trying to address.

Harmonizing multi-institutional EHR data for translational research

Addressing data heterogeneity and semantic differences across institutions

Enabling research without individual-level data sharing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Standardized pipeline for EHR harmonization

Machine learning generates research-ready datasets

Data model agnostic with open source software

🔎 Similar Papers

No similar papers found.