Benchmarking LLMs for Pairwise Causal Discovery in Biomedical and Multi-Domain Contexts

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This study addresses the unclear capability of large language models (LLMs) to identify and extract causal relationships in high-stakes domains such as biomedicine. The authors present the first standardized benchmark comprising 12 diverse datasets to systematically evaluate 13 open-source LLMs on two tasks: causal detection and causal extraction. They employ prompting strategies including zero-shot, chain-of-thought (CoT), and few-shot in-context learning (FICL). Based on high-quality human-validated annotations (Cohen’s κ ≥ 0.758), results reveal that even the best-performing models achieve limited accuracy—49.57% on causal detection and 47.12% on causal extraction—with pronounced performance degradation in implicit, cross-sentence, or multi-causal scenarios. This work establishes a unified evaluation framework and provides a multi-domain empirical foundation for assessing LLMs’ causal reasoning capabilities.

Technology Category

Application Category

📝 Abstract

The safe deployment of large language models (LLMs) in high-stakes fields like biomedicine, requires them to be able to reason about cause and effect. We investigate this ability by testing 13 open-source LLMs on a fundamental task: pairwise causal discovery (PCD) from text. Our benchmark, using 12 diverse datasets, evaluates two core skills: 1) \textbf{Causal Detection} (identifying if a text contains a causal link) and 2) \textbf{Causal Extraction} (pulling out the exact cause and effect phrases). We tested various prompting methods, from simple instructions (zero-shot) to more complex strategies like Chain-of-Thought (CoT) and Few-shot In-Context Learning (FICL). The results show major deficiencies in current models. The best model for detection, DeepSeek-R1-Distill-Llama-70B, only achieved a mean score of 49.57\% ($C_{detect}$), while the best for extraction, Qwen2.5-Coder-32B-Instruct, reached just 47.12\% ($C_{extract}$). Models performed best on simple, explicit, single-sentence relations. However, performance plummeted for more difficult (and realistic) cases, such as implicit relationships, links spanning multiple sentences, and texts containing multiple causal pairs. We provide a unified evaluation framework, built on a dataset validated with high inter-annotator agreement ($\kappa \ge 0.758$), and make all our data, code, and prompts publicly available to spur further research. \href{https://github.com/sydneyanuyah/CausalDiscovery}{Code available here: https://github.com/sydneyanuyah/CausalDiscovery}

Problem

Research questions and friction points this paper is trying to address.

pairwise causal discovery

causal detection

causal extraction

large language models

biomedical text

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pairwise Causal Discovery

Causal Reasoning

Large Language Models