🤖 AI Summary
This study addresses the unclear capability of large language models (LLMs) to identify and extract causal relationships in high-stakes domains such as biomedicine. The authors present the first standardized benchmark comprising 12 diverse datasets to systematically evaluate 13 open-source LLMs on two tasks: causal detection and causal extraction. They employ prompting strategies including zero-shot, chain-of-thought (CoT), and few-shot in-context learning (FICL). Based on high-quality human-validated annotations (Cohen’s κ ≥ 0.758), results reveal that even the best-performing models achieve limited accuracy—49.57% on causal detection and 47.12% on causal extraction—with pronounced performance degradation in implicit, cross-sentence, or multi-causal scenarios. This work establishes a unified evaluation framework and provides a multi-domain empirical foundation for assessing LLMs’ causal reasoning capabilities.
📝 Abstract
The safe deployment of large language models (LLMs) in high-stakes fields like biomedicine, requires them to be able to reason about cause and effect. We investigate this ability by testing 13 open-source LLMs on a fundamental task: pairwise causal discovery (PCD) from text. Our benchmark, using 12 diverse datasets, evaluates two core skills: 1) \textbf{Causal Detection} (identifying if a text contains a causal link) and 2) \textbf{Causal Extraction} (pulling out the exact cause and effect phrases). We tested various prompting methods, from simple instructions (zero-shot) to more complex strategies like Chain-of-Thought (CoT) and Few-shot In-Context Learning (FICL). The results show major deficiencies in current models. The best model for detection, DeepSeek-R1-Distill-Llama-70B, only achieved a mean score of 49.57\% ($C_{detect}$), while the best for extraction, Qwen2.5-Coder-32B-Instruct, reached just 47.12\% ($C_{extract}$). Models performed best on simple, explicit, single-sentence relations. However, performance plummeted for more difficult (and realistic) cases, such as implicit relationships, links spanning multiple sentences, and texts containing multiple causal pairs. We provide a unified evaluation framework, built on a dataset validated with high inter-annotator agreement ($\kappa \ge 0.758$), and make all our data, code, and prompts publicly available to spur further research. \href{https://github.com/sydneyanuyah/CausalDiscovery}{Code available here: https://github.com/sydneyanuyah/CausalDiscovery}