🤖 AI Summary
DAG learning faces two fundamental challenges: super-exponential computational complexity and structural unidentifiability under limited samples. To address these, this paper introduces the first foundational model framework for DAG learning. It pretrains a unified mapping from data distributions to both causal graph structures and their parametric forms, incorporating a shared low-dimensional prior to enhance few-shot generalization and zero-shot inference. We propose Attention-DAG (ADAG), a novel architecture that integrates linear Transformers with nonlinear attention kernels, enabling efficient end-to-end causal structure learning. Evaluated on standard synthetic benchmarks, ADAG achieves significant improvements in structural recovery accuracy, supports rapid zero-shot inference, and demonstrates strong scalability and practical deployability—thereby validating its effectiveness, generalizability, and potential for real-world causal discovery applications.
📝 Abstract
Due to its human-interpretability and invariance properties, Directed Acyclic Graph (DAG) has been a foundational tool across various areas of AI research, leading to significant advancements. However, DAG learning remains highly challenging, due to its super-exponential growth in computational cost and identifiability issues, particularly in small-sample regimes. To address these two challenges, in this work we leverage the recent success of linear transformers and develop a foundation model approach for discovering multiple order-consistent DAGs across tasks. In particular, we propose Attention-DAG (ADAG), a novel attention-mechanism-based architecture for learning multiple linear Structural Equation Models (SEMs). ADAG learns the mapping from observed data to both graph structure and parameters via a nonlinear attention-based kernel, enabling efficient multi-task estimation of the underlying linear SEMs. By formulating the learning process across multiple tasks as a continuous optimization problem, the pre-trained ADAG model captures the common structural properties as a shared low-dimensional prior, thereby reducing the ill-posedness of downstream DAG learning tasks in small-sample regimes. We evaluate our proposed approach on benchmark synthetic datasets and find that ADAG achieves substantial improvements in both DAG learning accuracy and zero-shot inference efficiency. To the best of our knowledge, this is the first practical approach for pre-training a foundation model specifically designed for DAG learning, representing a step toward more efficient and generalizable down-stream applications in causal discovery.