Evaluation of Causal Reasoning for Large Language Models in Contextualized Clinical Scenarios of Laboratory Test Interpretation

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current large language models (LLMs) lack systematic, expert-validated benchmarks for causal reasoning in evidence-based laboratory medicine. Method: This study introduces the first causal reasoning evaluation framework for clinical laboratory tests, grounded in Pearl’s causal ladder—spanning association, intervention, and counterfactual reasoning—comprising 99 items focused on biomarkers (e.g., HbA1c, creatinine, vitamin D) and covariates (e.g., age, sex, obesity, smoking). We evaluated GPT-4o and Llama-3.2-8B-Instruct using double-blind scoring by four medical experts and quantified performance via AUROC, sensitivity, and specificity. Contribution/Results: GPT-4o significantly outperformed Llama (AUROC = 0.80, sensitivity = 0.90, specificity = 0.93), yet both models exhibited marked limitations in counterfactual reasoning—particularly outcome-alteration queries. This work establishes the first expert-validated, hierarchical benchmark for assessing LLMs’ causal inference capabilities in clinical laboratory medicine.

Technology Category

Application Category

📝 Abstract

This study evaluates causal reasoning in large language models (LLMs) using 99 clinically grounded laboratory test scenarios aligned with Pearl's Ladder of Causation: association, intervention, and counterfactual reasoning. We examined common laboratory tests such as hemoglobin A1c, creatinine, and vitamin D, and paired them with relevant causal factors including age, gender, obesity, and smoking. Two LLMs - GPT-o1 and Llama-3.2-8b-instruct - were tested, with responses evaluated by four medically trained human experts. GPT-o1 demonstrated stronger discriminative performance (AUROC overall = 0.80 +/- 0.12) compared to Llama-3.2-8b-instruct (0.73 +/- 0.15), with higher scores across association (0.75 vs 0.72), intervention (0.84 vs 0.70), and counterfactual reasoning (0.84 vs 0.69). Sensitivity (0.90 vs 0.84) and specificity (0.93 vs 0.80) were also greater for GPT-o1, with reasoning ratings showing similar trends. Both models performed best on intervention questions and worst on counterfactuals, particularly in altered outcome scenarios. These findings suggest GPT-o1 provides more consistent causal reasoning, but refinement is required before adoption in high-stakes clinical applications.

Problem

Research questions and friction points this paper is trying to address.

Evaluating causal reasoning in LLMs using clinical lab test scenarios

Assessing LLM performance on association, intervention, and counterfactual reasoning

Comparing GPT-o1 and Llama-3.2 models for clinical decision support

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated causal reasoning using Pearl's Ladder of Causation

Tested LLMs on clinical lab scenarios with expert evaluation

Assessed performance on association, intervention, and counterfactual reasoning

🔎 Similar Papers

Causal Inference with Large Language Model: A Survey

2024-09-15arXiv.orgCitations: 3

Authors to Follow