๐ค AI Summary
Existing EHR entity retrieval research lacks a publicly available, semantics-aware benchmark to systematically address the semantic gap between clinical queries and structured EHR documents.
Method: We introduce EHR-ERB, the first standardized benchmark for EHR entity retrieval, constructed from MIMIC-III discharge summaries, ICD codes, and prescription labelsโyielding 1,246 queries over 1,000 documents. We define five fine-grained semantic matching categories (exact string, synonymy, abbreviation, hypernymy/hyponymy, and entailment) and employ GPT-4 to generate over 77,000 relevance annotations.
Contribution/Results: Extensive evaluation of BM25, query expansion, and general/biomedical dense retrievers (ColBERT, ANCE) reveals: (1) BM25 excels at lexical matching but fails on semantic generalization; (2) query expansion improves semantic recall at the cost of precision; and (3) general-domain models outperform domain-specialized ones on semantic matching. This work provides the first systematic quantification of the EHR semantic gap and establishes a reproducible, attributable foundation for future entity retrieval research.
๐ Abstract
Entity retrieval plays a crucial role in the utilization of Electronic Health Records (EHRs) and is applied across a wide range of clinical practices. However, a comprehensive evaluation of this task is lacking due to the absence of a public benchmark. In this paper, we propose the development and release of a novel benchmark for evaluating entity retrieval in EHRs, with a particular focus on the semantic gap issue. Using discharge summaries from the MIMIC-III dataset, we incorporate ICD codes and prescription labels associated with the notes as queries, and annotate relevance judgments using GPT-4. In total, we use 1,000 patient notes, generate 1,246 queries, and provide over 77,000 relevance annotations. To offer the first assessment of the semantic gap, we introduce a novel classification system for relevance matches. Leveraging GPT-4, we categorize each relevant pair into one of five categories: string, synonym, abbreviation, hyponym, and implication. Using the proposed benchmark, we evaluate several retrieval methods, including BM25, query expansion, and state-of-the-art dense retrievers. Our findings show that BM25 provides a strong baseline but struggles with semantic matches. Query expansion significantly improves performance, though it slightly reduces string match capabilities. Dense retrievers outperform traditional methods, particularly for semantic matches, and general-domain dense retrievers often surpass those trained specifically in the biomedical domain.