Counterfactual Influence in Markov Decision Processes

📅 2024-02-13
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In Markov decision processes (MDPs), counterfactual reasoning suffers from “influence decay”: over time, counterfactual trajectories diverge from observed ones, causing observed data to lose constraining power over the counterfactual distribution—thereby collapsing counterfactual estimation into mere interventional estimation. This issue is formally characterized for the first time, and we show that Gumbel-max structural causal models (SCMs) fail in MDPs due to their lack of influence-preservation mechanisms. Method: We propose an influence metric based on the divergence between counterfactual and interventional distributions, and develop a counterfactual policy learning framework enforcing persistent trajectory-level influence constraints. Our approach integrates structural causal modeling, Gumbel-max reparameterization, and constrained optimization. Results: Evaluated across multiple MDP benchmarks, our method learns near-optimal policies even under strong influence constraints, significantly outperforming conventional interventional baselines.

Technology Category

Application Category

📝 Abstract
Our work addresses a fundamental problem in the context of counterfactual inference for Markov Decision Processes (MDPs). Given an MDP path $ au$, this kind of inference allows us to derive counterfactual paths $ au'$ describing what-if versions of $ au$ obtained under different action sequences than those observed in $ au$. However, as the counterfactual states and actions deviate from the observed ones over time, the observation $ au$ may no longer influence the counterfactual world, meaning that the analysis is no longer tailored to the individual observation, resulting in interventional outcomes rather than counterfactual ones. Even though this issue specifically affects the popular Gumbel-max structural causal model used for MDP counterfactuals, it has remained overlooked until now. In this work, we introduce a formal characterisation of influence based on comparing counterfactual and interventional distributions. We devise an algorithm to construct counterfactual models that automatically satisfy influence constraints. Leveraging such models, we derive counterfactual policies that are not just optimal for a given reward structure but also remain tailored to the observed path. Even though there is an unavoidable trade-off between policy optimality and strength of influence constraints, our experiments demonstrate that it is possible to derive (near-)optimal policies while remaining under the influence of the observation.
Problem

Research questions and friction points this paper is trying to address.

Ensures counterfactual paths remain influenced by observed MDP trajectories
Addresses overlooked influence loss in Gumbel-max MDP counterfactual models
Balances policy optimality and observation influence constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Formal influence characterization via distribution comparison
Algorithm for influence-constrained counterfactual models
Near-optimal policies under observation influence
🔎 Similar Papers
No similar papers found.