Unifying Image Counterfactuals and Feature Attributions with Latent-Space Adversarial Attacks

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This paper addresses the limitations of existing counterfactual explanations for image models—namely, susceptibility to pixel-level adversarial perturbations, lack of semantic plausibility, and absence of global interpretability. To this end, we propose an implicit-space adversarial counterfactual generation framework. Methodologically: (1) adversarial optimization is performed in the low-dimensional latent space of a pre-trained generative model, avoiding direct pixel-space manipulation; (2) we unify counterfactual image generation with feature attribution for the first time, enabling cross-sample quantification of global feature importance via auxiliary descriptive datasets; and (3) the framework supports plug-and-play integration with state-of-the-art generative models (e.g., Diffusion models, GANs). Experiments on MNIST and CelebA demonstrate that our method produces semantically coherent, controllable counterfactual images while yielding interpretable, globally consistent feature importance scores. The approach is lightweight, robust to input perturbations, and generalizes well across architectures and datasets.

Technology Category

Application Category

📝 Abstract

Counterfactuals are a popular framework for interpreting machine learning predictions. These what if explanations are notoriously challenging to create for computer vision models: standard gradient-based methods are prone to produce adversarial examples, in which imperceptible modifications to image pixels provoke large changes in predictions. We introduce a new, easy-to-implement framework for counterfactual images that can flexibly adapt to contemporary advances in generative modeling. Our method, Counterfactual Attacks, resembles an adversarial attack on the representation of the image along a low-dimensional manifold. In addition, given an auxiliary dataset of image descriptors, we show how to accompany counterfactuals with feature attribution that quantify the changes between the original and counterfactual images. These importance scores can be aggregated into global counterfactual explanations that highlight the overall features driving model predictions. While this unification is possible for any counterfactual method, it has particular computational efficiency for ours. We demonstrate the efficacy of our approach with the MNIST and CelebA datasets.

Problem

Research questions and friction points this paper is trying to address.

Unifying counterfactuals and feature attributions for interpretable ML

Generating non-adversarial counterfactual images via latent-space attacks

Providing efficient global explanations for vision model predictions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent-space adversarial attacks for counterfactuals

Flexible adaptation to generative modeling advances

Efficient feature attribution for counterfactual explanations

🔎 Similar Papers

Regeneration Based Training-free Attribution of Fake Images Generated by Text-to-Image Generative Models