Bias Leaves a Gradient Trail: Label-Free Bias Identification via Gradient Probes on Concept Decompositions

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

137K/year
🤖 AI Summary
This work addresses the vulnerability of vision classifiers to distribution shifts caused by reliance on spurious correlations, a challenge exacerbated by the inapplicability of existing bias identification methods—which often require auxiliary labels or model retraining—to deployed, frozen models. The authors propose a post-hoc, label-free approach that leverages only standard class labels to extract interpretable concept vectors via non-negative matrix factorization and assesses their bias through gradient-based probes on misclassified samples. This method enables, for the first time, the identification and mitigation of decision-relevant spurious concepts in frozen models without retraining, revealing concepts not necessarily aligned with human-annotated attributes. Experiments demonstrate worst-group accuracy improvements of 17.9% and 10.4% on Waterbirds and CelebA, respectively, successfully recover known spurious cues in Colored MNIST and Waterbirds, and uncover previously unannotated bias directions in CelebA.
📝 Abstract
Vision classifiers can exploit spurious correlations, achieving high in-distribution accuracy yet failing under distribution shift. Existing approaches to bias mitigation and analysis often depend on curated datasets, spurious-attribute or group labels, or retraining, which may be infeasible once a model is deployed or the relevant bias is unknown. We present a bias-label-free, post-hoc method for identifying spurious concepts in frozen vision models, relying only on standard class labels from a held-out audit dataset. For each target class, we collect patches from inputs predicted as that class and apply non-negative matrix factorization to intermediate activations to obtain a bank of interpretable concept vectors. Candidate concepts are then ranked with a bias estimator derived from their interaction with backpropagated gradients on misclassified examples: bias concepts tend to get activated when correcting false negatives and suppressed when correcting false positives. On Colored MNIST and Waterbirds the method recovers concepts aligned with the known spurious cue, and on CelebA it surfaces decision-relevant directions that only partially coincide with the annotated gender attribute; suppressing the top-ranked concepts at inference time improves worst-group accuracy by up to 17.9 percentage points on Waterbirds and 10.4 on CelebA without any retraining or parameter updates. Our method identifies decision-relevant spurious directions that need not coincide with annotated ones, providing both an interpretable auditing tool and an actionable debiasing handle for frozen vision models. Code is available at https://github.com/vitryt/label-free-bias-identification.
Problem

Research questions and friction points this paper is trying to address.

bias identification
spurious correlations
frozen vision models
post-hoc analysis
label-free
Innovation

Methods, ideas, or system contributions that make the work stand out.

gradient probes
concept decomposition
label-free bias identification
spurious correlation
post-hoc debiasing