🤖 AI Summary
Existing causal discovery methods face scalability bottlenecks on large-scale graphs (≥70 nodes), mixed-type data (continuous and discrete variables), and computational efficiency. Method: This paper proposes a dual-encoder DAG learning framework that jointly leverages large language model (LLM)-derived structural priors and observational data. Specifically, an LLM-generated adjacency matrix serves as a semantic prior, integrated with gradient-based relaxed reinforcement learning to jointly optimize data-driven signals and domain knowledge under the acyclicity constraint. Contribution/Results: Experiments show the method reduces runtime by 42% on average compared to RL-BIC and KCRL; improves structural recovery accuracy by ~117% over NOTEARS and GraN-DAG; and significantly outperforms existing baselines on large-scale and mixed-data benchmarks—demonstrating superior efficiency, generalizability, and scalability.
📝 Abstract
Modern causal discovery methods face critical limitations in scalability, computational efficiency, and adaptability to mixed data types, as evidenced by benchmarks on node scalability (30, $le 50$, $ge 70$ nodes), computational energy demands, and continuous/non-continuous data handling. While traditional algorithms like PC, GES, and ICA-LiNGAM struggle with these challenges, exhibiting prohibitive energy costs for higher-order nodes and poor scalability beyond 70 nodes, we propose extbf{GUIDE}, a framework that integrates Large Language Model (LLM)-generated adjacency matrices with observational data through a dual-encoder architecture. GUIDE uniquely optimizes computational efficiency, reducing runtime on average by $approx 42%$ compared to RL-BIC and KCRL methods, while achieving an average $approx 117%$ improvement in accuracy over both NOTEARS and GraN-DAG individually. During training, GUIDE's reinforcement learning agent dynamically balances reward maximization (accuracy) and penalty avoidance (DAG constraints), enabling robust performance across mixed data types and scalability to $ge 70$ nodes -- a setting where baseline methods fail.