🤖 AI Summary
EHR retrieval suffers from a semantic gap, and existing dense retrieval models are hindered by insufficient biomedical knowledge or domain-mismatched training data. To address this, we propose DR.EHR—the first dense retrieval model specifically designed for EHRs—employing a two-stage training paradigm that integrates biomedical knowledge graph entity injection and large language model–driven synthetic data augmentation. This design significantly enhances clinical semantic matching capability. DR.EHR comprises two variants: 110M- and 7B-parameter models. On the CliniQ benchmark, it achieves state-of-the-art performance across all metrics, outperforming prior methods. Moreover, it demonstrates strong generalization on diverse clinical query types and EHR-based question answering tasks. Our core contributions lie in (i) knowledge-guided synthetic data construction and (ii) a domain-adaptive, two-stage training framework tailored for EHR retrieval.
📝 Abstract
Electronic Health Records (EHRs) are pivotal in clinical practices, yet their retrieval remains a challenge mainly due to semantic gap issues. Recent advancements in dense retrieval offer promising solutions but existing models, both general-domain and biomedical-domain, fall short due to insufficient medical knowledge or mismatched training corpora. This paper introduces exttt{DR.EHR}, a series of dense retrieval models specifically tailored for EHR retrieval. We propose a two-stage training pipeline utilizing MIMIC-IV discharge summaries to address the need for extensive medical knowledge and large-scale training data. The first stage involves medical entity extraction and knowledge injection from a biomedical knowledge graph, while the second stage employs large language models to generate diverse training data. We train two variants of exttt{DR.EHR}, with 110M and 7B parameters, respectively. Evaluated on the CliniQ benchmark, our models significantly outperforms all existing dense retrievers, achieving state-of-the-art results. Detailed analyses confirm our models' superiority across various match and query types, particularly in challenging semantic matches like implication and abbreviation. Ablation studies validate the effectiveness of each pipeline component, and supplementary experiments on EHR QA datasets demonstrate the models' generalizability on natural language questions, including complex ones with multiple entities. This work significantly advances EHR retrieval, offering a robust solution for clinical applications.