Towards Causal Model-Based Policy Optimization

πŸ“… 2025-03-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Traditional model-based reinforcement learning (MBRL) suffers from poor robustness to distributional shifts and limited generalization due to its failure to capture the underlying causal mechanisms of environment dynamics, leading it to learn spurious correlations. To address this, we propose Causal Model-based Policy Optimization (C-MBPO), the first MBRL framework integrating Structural Causal Models (SCMs) to formulate an intervenable and interpretable Causal Markov Decision Process (C-MDP). C-MBPO jointly performs online trajectory learning, local SCM identification, causal Bayesian network inference, counterfactual state–reward simulation, and intervention-aware policy gradient optimization. Experiments demonstrate that C-MBPO significantly improves policy robustness and out-of-distribution generalization under both near- and far-domain distribution shifts. Crucially, it accurately detects and suppresses spurious correlations, yielding stable, interpretable, and causally grounded decision-making.

Technology Category

Application Category

πŸ“ Abstract
Real-world decision-making problems are often marked by complex, uncertain dynamics that can shift or break under changing conditions. Traditional Model-Based Reinforcement Learning (MBRL) approaches learn predictive models of environment dynamics from queried trajectories and then use these models to simulate rollouts for policy optimization. However, such methods do not account for the underlying causal mechanisms that govern the environment, and thus inadvertently capture spurious correlations, making them sensitive to distributional shifts and limiting their ability to generalize. The same naturally holds for model-free approaches. In this work, we introduce Causal Model-Based Policy Optimization (C-MBPO), a novel framework that integrates causal learning into the MBRL pipeline to achieve more robust, explainable, and generalizable policy learning algorithms. Our approach centers on first inferring a Causal Markov Decision Process (C-MDP) by learning a local Structural Causal Model (SCM) of both the state and reward transition dynamics from trajectories gathered online. C-MDPs differ from classic MDPs in that we can decompose causal dependencies in the environment dynamics via specifying an associated Causal Bayesian Network. C-MDPs allow for targeted interventions and counterfactual reasoning, enabling the agent to distinguish between mere statistical correlations and causal relationships. The learned SCM is then used to simulate counterfactual on-policy transitions and rewards under hypothetical actions (or ``interventions"), thereby guiding policy optimization more effectively. The resulting policy learned by C-MBPO can be shown to be robust to a class of distributional shifts that affect spurious, non-causal relationships in the dynamics. We demonstrate this through some simple experiments involving near and far OOD dynamics drifts.
Problem

Research questions and friction points this paper is trying to address.

Addresses sensitivity to distributional shifts in decision-making.
Integrates causal learning to improve policy generalization.
Uses Causal Markov Decision Processes for robust policy optimization.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates causal learning into MBRL pipeline
Uses Causal Markov Decision Process (C-MDP)
Simulates counterfactual transitions for policy optimization
πŸ”Ž Similar Papers
No similar papers found.