GENIE: Generative Note Information Extraction model for structuring EHR data

📅 2025-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing EHR text structuring methods suffer from low efficiency and poor generalizability, while large language models (LLMs) remain impractical for clinical deployment due to prohibitive computational costs. Method: We propose the first end-to-end generative information extraction framework tailored for clinical notes, enabling joint identification of entities, assertion status, anatomical locations, modifiers, numeric values, and clinical purposes in a single inference pass. Our approach leverages a fine-tuned small open-weight LLM, integrated with robust preprocessing and a unified generative paradigm—eliminating reliance on rule-based engines or post-processing modules. Contribution/Results: The framework supports extensible attribute definitions and significantly outperforms traditional tools (e.g., cTAKES, MetaMap) on entity recognition and relation extraction. It achieves both high accuracy and lightweight deployability, and has been successfully scaled to production in real-world healthcare systems. We publicly release both the model and a benchmark evaluation dataset.

Technology Category

Application Category

📝 Abstract
Electronic Health Records (EHRs) hold immense potential for advancing healthcare, offering rich, longitudinal data that combines structured information with valuable insights from unstructured clinical notes. However, the unstructured nature of clinical text poses significant challenges for secondary applications. Traditional methods for structuring EHR free-text data, such as rule-based systems and multi-stage pipelines, are often limited by their time-consuming configurations and inability to adapt across clinical notes from diverse healthcare settings. Few systems provide a comprehensive attribute extraction for terminologies. While giant large language models (LLMs) like GPT-4 and LLaMA 405B excel at structuring tasks, they are slow, costly, and impractical for large-scale use. To overcome these limitations, we introduce GENIE, a Generative Note Information Extraction system that leverages LLMs to streamline the structuring of unstructured clinical text into usable data with standardized format. GENIE processes entire paragraphs in a single pass, extracting entities, assertion statuses, locations, modifiers, values, and purposes with high accuracy. Its unified, end-to-end approach simplifies workflows, reduces errors, and eliminates the need for extensive manual intervention. Using a robust data preparation pipeline and fine-tuned small scale LLMs, GENIE achieves competitive performance across multiple information extraction tasks, outperforming traditional tools like cTAKES and MetaMap and can handle extra attributes to be extracted. GENIE strongly enhances real-world applicability and scalability in healthcare systems. By open-sourcing the model and test data, we aim to encourage collaboration and drive further advancements in EHR structurization.
Problem

Research questions and friction points this paper is trying to address.

Electronic Health Records
Language Models
Information Extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

GENIE Model
Automated Information Extraction
Medical Data Standardization
🔎 Similar Papers
2024-05-27International Conference on Information and Knowledge ManagementCitations: 4
H
Huaiyuan Ying
Department of Statistics and Data Science, Tsinghua University, Beijing, China
Hongyi Yuan
Hongyi Yuan
Tsinghua University, Harvard University
J
Jinsen Lu
Department of Statistics and Data Science, Tsinghua University, Beijing, China
Z
Zitian Qu
Zhili College, Tsinghua University, Beijing, China
Y
Yang Zhao
Weiyang College, Tsinghua University, Beijing, China
Zhengyun Zhao
Zhengyun Zhao
Tsinghua University
Large Language ModelInformation RetrievalMedical AI
Isaac Kohane
Isaac Kohane
Harvard Medical School, Children's Hospital, Brigham and Women's Hospital
BioinformaticsArtificial IntelligenceAutismElectronic Health RecordsFunctional Genomics
Tianxi Cai
Tianxi Cai
Harvard University
statisticsbiostatisticsmodelingpredictiongenomics
S
Sheng Yu
Department of Statistics and Data Science, Tsinghua University, Beijing, China