🤖 AI Summary
Existing interpretability methods predominantly rely on statistical associations, failing to uncover the causal mechanisms underlying black-box models—especially when input variables exhibit dependencies, leading to inaccurate attribution. This paper introduces “counterfactual interpretability,” a novel conceptual framework for causal attribution. It extends global sensitivity analysis to a counterfactual causal setting, establishing a complete algebraic system of explanations encompassing main effects, interaction effects, and variable dependency structures. By integrating functional ANOVA, Sobol indices, and DAG-guided causal sensitivity analysis, the framework delivers causal-driven, decomposable, and quantifiable interpretations for black-box models under arbitrary dependency structures. Experiments demonstrate that our method significantly outperforms mainstream association-based approaches on causal paradox benchmarks, validating its superiority in revealing true causal influences.
📝 Abstract
It is crucial to be able to explain black-box prediction models to use them effectively and safely in practice. Most existing tools for model explanations are associational rather than causal, and we use two paradoxical examples to show that such explanations are generally inadequate. Motivated by the concept of genetic heritability in twin studies, we propose a new notion called counterfactual explainability for black-box prediction models. Counterfactual explainability has three key advantages: (1) it leverages counterfactual outcomes and extends methods for global sensitivity analysis (such as functional analysis of variance and Sobol's indices) to a causal setting; (2) it is defined not only for the totality of a set of input factors but also for their interactions (indeed, it is a probability measure on a whole ``explanation algebra''); (3) it also applies to dependent input factors whose causal relationship can be modeled by a directed acyclic graph, thus incorporating causal mechanisms into the explanation.