🤖 AI Summary
This study addresses the critical issue that existing causal discovery benchmark graphs often diverge from up-to-date domain knowledge, thereby compromising the reliability of method evaluations—particularly for large language model–based approaches that rely heavily on scientific literature. To tackle this problem, the authors propose the first automated and scalable validation framework that integrates scientific literature retrieval, large language model prompting, and causal graph alignment techniques to systematically assess the consistency between 11 widely used real-world benchmarks and 38,081 relevant research papers. The analysis reveals substantial discrepancies in the temporal validity of current benchmarks, offering crucial insights for evaluating and refining causal discovery methodologies in light of evolving scientific knowledge.
📝 Abstract
In graphical causal model, causal discovery aims to construct a causal graph based on numerical data and domain knowledge in plain text. However, the evaluation of causal discovery methods remains a challenge in the area as the progress of domain researches often makes benchmark causal graphs contain mis-aligned knowledge. This problem especially affects the evaluation of large language model (LLM) based causal discovery methods as they are sensitive to the new discoveries in the literature. This work is the first to systematically study the quality of benchmark causal graphs. Specifically, we design a pipeline that automatically retrieves relevant research papers from scientific databases, and prompts LLMs to check the consistency between the benchmark causal graphs and domain research papers. We evaluate 11 popular real-world benchmarks, for which our pipeline in total proceeds 38,081 domain papers. Our results show that popular benchmarks vary significantly in their consistency with domain research, with clear implications for causal discovery research.