🤖 AI Summary
Current large language models (LLMs) lack systematic, expert-validated benchmarks for causal reasoning in evidence-based laboratory medicine. Method: This study introduces the first causal reasoning evaluation framework for clinical laboratory tests, grounded in Pearl’s causal ladder—spanning association, intervention, and counterfactual reasoning—comprising 99 items focused on biomarkers (e.g., HbA1c, creatinine, vitamin D) and covariates (e.g., age, sex, obesity, smoking). We evaluated GPT-4o and Llama-3.2-8B-Instruct using double-blind scoring by four medical experts and quantified performance via AUROC, sensitivity, and specificity. Contribution/Results: GPT-4o significantly outperformed Llama (AUROC = 0.80, sensitivity = 0.90, specificity = 0.93), yet both models exhibited marked limitations in counterfactual reasoning—particularly outcome-alteration queries. This work establishes the first expert-validated, hierarchical benchmark for assessing LLMs’ causal inference capabilities in clinical laboratory medicine.
📝 Abstract
This study evaluates causal reasoning in large language models (LLMs) using 99 clinically grounded laboratory test scenarios aligned with Pearl's Ladder of Causation: association, intervention, and counterfactual reasoning. We examined common laboratory tests such as hemoglobin A1c, creatinine, and vitamin D, and paired them with relevant causal factors including age, gender, obesity, and smoking. Two LLMs - GPT-o1 and Llama-3.2-8b-instruct - were tested, with responses evaluated by four medically trained human experts. GPT-o1 demonstrated stronger discriminative performance (AUROC overall = 0.80 +/- 0.12) compared to Llama-3.2-8b-instruct (0.73 +/- 0.15), with higher scores across association (0.75 vs 0.72), intervention (0.84 vs 0.70), and counterfactual reasoning (0.84 vs 0.69). Sensitivity (0.90 vs 0.84) and specificity (0.93 vs 0.80) were also greater for GPT-o1, with reasoning ratings showing similar trends. Both models performed best on intervention questions and worst on counterfactuals, particularly in altered outcome scenarios. These findings suggest GPT-o1 provides more consistent causal reasoning, but refinement is required before adoption in high-stakes clinical applications.