Evaluation of Causal Reasoning for Large Language Models in Contextualized Clinical Scenarios of Laboratory Test Interpretation

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) lack systematic, expert-validated benchmarks for causal reasoning in evidence-based laboratory medicine. Method: This study introduces the first causal reasoning evaluation framework for clinical laboratory tests, grounded in Pearl’s causal ladder—spanning association, intervention, and counterfactual reasoning—comprising 99 items focused on biomarkers (e.g., HbA1c, creatinine, vitamin D) and covariates (e.g., age, sex, obesity, smoking). We evaluated GPT-4o and Llama-3.2-8B-Instruct using double-blind scoring by four medical experts and quantified performance via AUROC, sensitivity, and specificity. Contribution/Results: GPT-4o significantly outperformed Llama (AUROC = 0.80, sensitivity = 0.90, specificity = 0.93), yet both models exhibited marked limitations in counterfactual reasoning—particularly outcome-alteration queries. This work establishes the first expert-validated, hierarchical benchmark for assessing LLMs’ causal inference capabilities in clinical laboratory medicine.

Technology Category

Application Category

📝 Abstract
This study evaluates causal reasoning in large language models (LLMs) using 99 clinically grounded laboratory test scenarios aligned with Pearl's Ladder of Causation: association, intervention, and counterfactual reasoning. We examined common laboratory tests such as hemoglobin A1c, creatinine, and vitamin D, and paired them with relevant causal factors including age, gender, obesity, and smoking. Two LLMs - GPT-o1 and Llama-3.2-8b-instruct - were tested, with responses evaluated by four medically trained human experts. GPT-o1 demonstrated stronger discriminative performance (AUROC overall = 0.80 +/- 0.12) compared to Llama-3.2-8b-instruct (0.73 +/- 0.15), with higher scores across association (0.75 vs 0.72), intervention (0.84 vs 0.70), and counterfactual reasoning (0.84 vs 0.69). Sensitivity (0.90 vs 0.84) and specificity (0.93 vs 0.80) were also greater for GPT-o1, with reasoning ratings showing similar trends. Both models performed best on intervention questions and worst on counterfactuals, particularly in altered outcome scenarios. These findings suggest GPT-o1 provides more consistent causal reasoning, but refinement is required before adoption in high-stakes clinical applications.
Problem

Research questions and friction points this paper is trying to address.

Evaluating causal reasoning in LLMs using clinical lab test scenarios
Assessing LLM performance on association, intervention, and counterfactual reasoning
Comparing GPT-o1 and Llama-3.2 models for clinical decision support
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated causal reasoning using Pearl's Ladder of Causation
Tested LLMs on clinical lab scenarios with expert evaluation
Assessed performance on association, intervention, and counterfactual reasoning
🔎 Similar Papers
B
Balu Bhasuran
School of Information, Florida State University, Tallahassee, FL, USA
Mattia Prosperi
Mattia Prosperi
University of Florida
biomedical informaticsartificial intelligencedata scienceepidemiologybioinformatics
K
Karim Hanna
Morsani College of Medicine, University of South Florida, Tampa, FL, USA
J
John Petrilli
Morsani College of Medicine, University of South Florida, Tampa, FL, USA
C
Caretia JeLayne Washington
Department of Epidemiology, University of Florida, Gainesville, FL, USA
Zhe He
Zhe He
University of Macau
deep learningreinforcement learningPOMDPs