TabCausal: Pretraining Across Causal Environments for Tabular Causal Discovery

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work addresses the challenge of efficiently and reliably recovering transferable causal graph structures from both observational and interventional data. To this end, the authors propose a data-driven foundational model for causal discovery that, through pretraining across diverse causal environments, directly maps tabular data to causal graphs—bypassing the need for dataset-specific search or optimization inherent in traditional approaches. The method innovatively incorporates a dynamic task construction strategy that integrates multiple graph priors, causal mechanisms, noise models, and intervention types. A large-scale benchmark comprising synthetic and semantically grounded structural causal models (SCMs) is introduced, enhanced with LLM-assisted semantic validation. Experiments demonstrate that the proposed approach consistently outperforms existing methods on both synthetic and real-world semantic benchmarks, exhibiting superior causal structure recovery and out-of-distribution generalization, particularly when interventional data are available.

📝 Abstract

Causal discovery aims to recover directed causal relations from observational and interventional data, providing a basis for mechanistic understanding and reliable decision-making. Causal discovery foundation models (CDFMs) seek to amortize this problem by mapping a dataset directly to a causal graph in a single forward pass, avoiding per-dataset testing, search, or optimization. However, existing CDFMs remain limited, often failing to consistently match strong classical methods, and we find that a key bottleneck is how causal pretraining tasks are constructed. Based on this observation, we propose TabCausal, a data-driven CDFM trained with broad causal pretraining over diverse graph priors, structural mechanisms, noise models, dimensions, sample sizes, and intervention regimes. A dynamic task construction strategy composes these causal environments into varied discovery tasks, enabling more transferable structural learning from observational and mixed-interventional data. On large-scale synthetic benchmarks, TabCausal achieves better macro-averaged performance than a diverse set of causal discovery baselines. To further bridge abstract synthetic generators and realistic causal reasoning scenarios, we introduce a protocol-guided and LLM-audited semantic causal environment benchmark, where domain-grounded SCMs generate interpretable observational and interventional datasets for out-of-distribution analysis. Across both synthetic and semantic environments, TabCausal demonstrates robust structure recovery, especially under interventional evidence, highlighting broad causal pretraining as a key ingredient for transferable amortized causal discovery.

Problem

Research questions and friction points this paper is trying to address.

causal discovery

foundation models

pretraining

causal graphs

interventional data

Innovation

Methods, ideas, or system contributions that make the work stand out.

causal discovery foundation model

broad causal pretraining

dynamic task construction