🤖 AI Summary
This study addresses the time-consuming, subjective, and non-scalable nature of manual electronic health record (EHR) review for staging cognitive impairment. We propose a zero-shot, automated approach leveraging GPT-4o—without fine-tuning or annotated training data—to interpret unstructured clinical notes and real-world, longitudinal insurance claims data (MGH Memory Clinic + Medicare), mapping them to Clinical Dementia Rating (CDR) scale scores and a three-class diagnosis (MCI, dementia, cognitively normal). Our key contribution is the first systematic validation of large language models’ high inter-rater agreement with clinical experts under zero-shot conditions: weighted Kappa = 0.83 for CDR staging on specialist notes; Kappa = 0.91 for three-class classification across 860 Medicare patients (vs. expert adjudication), rising to 0.96 on a high-confidence subset. This work overcomes the annotation dependency of conventional supervised NLP, establishing a scalable, efficient paradigm for large-scale cognitive impairment screening and research.
📝 Abstract
Identifying cognitive impairment within electronic health records (EHRs) is crucial not only for timely diagnoses but also for facilitating research. Information about cognitive impairment often exists within unstructured clinician notes in EHRs, but manual chart reviews are both time-consuming and error-prone. To address this issue, our study evaluates an automated approach using zero-shot GPT-4o to determine stage of cognitive impairment in two different tasks. First, we evaluated the ability of GPT-4o to determine the global Clinical Dementia Rating (CDR) on specialist notes from 769 patients who visited the memory clinic at Massachusetts General Hospital (MGH), and achieved a weighted kappa score of 0.83. Second, we assessed GPT-4o's ability to differentiate between normal cognition, mild cognitive impairment (MCI), and dementia on all notes in a 3-year window from 860 Medicare patients. GPT-4o attained a weighted kappa score of 0.91 in comparison to specialist chart reviews and 0.96 on cases that the clinical adjudicators rated with high confidence. Our findings demonstrate GPT-4o's potential as a scalable chart review tool for creating research datasets and assisting diagnosis in clinical settings in the future.