๐ค AI Summary
This paper addresses the limitations of existing counterfactual explanations for image modelsโnamely, susceptibility to pixel-level adversarial perturbations, lack of semantic plausibility, and absence of global interpretability. To this end, we propose an implicit-space adversarial counterfactual generation framework. Methodologically: (1) adversarial optimization is performed in the low-dimensional latent space of a pre-trained generative model, avoiding direct pixel-space manipulation; (2) we unify counterfactual image generation with feature attribution for the first time, enabling cross-sample quantification of global feature importance via auxiliary descriptive datasets; and (3) the framework supports plug-and-play integration with state-of-the-art generative models (e.g., Diffusion models, GANs). Experiments on MNIST and CelebA demonstrate that our method produces semantically coherent, controllable counterfactual images while yielding interpretable, globally consistent feature importance scores. The approach is lightweight, robust to input perturbations, and generalizes well across architectures and datasets.
๐ Abstract
Counterfactuals are a popular framework for interpreting machine learning predictions. These what if explanations are notoriously challenging to create for computer vision models: standard gradient-based methods are prone to produce adversarial examples, in which imperceptible modifications to image pixels provoke large changes in predictions. We introduce a new, easy-to-implement framework for counterfactual images that can flexibly adapt to contemporary advances in generative modeling. Our method, Counterfactual Attacks, resembles an adversarial attack on the representation of the image along a low-dimensional manifold. In addition, given an auxiliary dataset of image descriptors, we show how to accompany counterfactuals with feature attribution that quantify the changes between the original and counterfactual images. These importance scores can be aggregated into global counterfactual explanations that highlight the overall features driving model predictions. While this unification is possible for any counterfactual method, it has particular computational efficiency for ours. We demonstrate the efficacy of our approach with the MNIST and CelebA datasets.