🤖 AI Summary
This study investigates large language models’ (LLMs) capacity to comprehend and reason with clinical measurement data—specifically blood pressure (BP). To this end, we introduce BPQA, the first BP-focused medical question-answering benchmark, comprising 100 physician-validated question-answer pairs. We conduct systematic evaluations across BERT, BioBERT, MedAlpaca, and GPT-3.5. Our results demonstrate, for the first time quantitatively, that mainstream LLMs can effectively integrate BP values into clinical reasoning: GPT-3.5 and MedAlpaca achieve notably superior performance. Moreover, structured (i.e., normalized and annotated) BP representations improve accuracy by up to 12.3% for BioBERT and MedAlpaca. Crucially, retrieval-augmented inference is shown to significantly enhance domain-specific models’ ability to leverage numerical clinical measurements. This work establishes both a novel benchmark and a methodological foundation for measurement-driven medical QA, advancing the integration of quantitative physiological data into LLM-based clinical decision support.
📝 Abstract
Clinical measurements such as blood pressures and respiration rates are critical in diagnosing and monitoring patient outcomes. It is an important component of biomedical data, which can be used to train transformer-based language models (LMs) for improving healthcare delivery. It is, however, unclear whether LMs can effectively interpret and use clinical measurements. We investigate two questions: First, can LMs effectively leverage clinical measurements to answer related medical questions? Second, how to enhance an LM's performance on medical question-answering (QA) tasks that involve measurements? We performed a case study on blood pressure readings (BPs), a vital sign routinely monitored by medical professionals. We evaluated the performance of four LMs: BERT, BioBERT, MedAlpaca, and GPT-3.5, on our newly developed dataset, BPQA (Blood Pressure Question Answering). BPQA contains $100$ medical QA pairs that were verified by medical students and designed to rely on BPs . We found that GPT-3.5 and MedAlpaca (larger and medium sized LMs) benefit more from the inclusion of BPs than BERT and BioBERT (small sized LMs). Further, augmenting measurements with labels improves the performance of BioBERT and Medalpaca (domain specific LMs), suggesting that retrieval may be useful for improving domain-specific LMs.