🤖 AI Summary
Existing occlusion-based feature attribution methods rely on external baselines and are highly sensitive to input perturbations, leading to unstable explanations and attribution shifts in nonlinear models. This work proposes XtrAIn, a novel approach that shifts the occlusion operation from input space to parameter space by analyzing the influence of feature-related parameter updates along the training trajectory on model outputs. By doing so, XtrAIn avoids distributional shift and attribution drift. The method further introduces a lightweight approximation, Xstep, and a target-focused variant, XtrAIn+. Evaluated on image classification and PAM50 breast cancer subtype tasks, XtrAIn generates attribution maps that significantly outperform existing methods in both clarity and interpretability.
📝 Abstract
Occlusion-based attribution methods provide an intuitive way to estimate feature importance by perturbing input features and measuring the resulting change in model output. However, their reliability is strongly affected by how feature removal is implemented: externally selected baselines can introduce bias, out-of-distribution samples, and unstable explanations, while in nonlinear models the occlusion of a set of features can also alter the contribution of non-occluded features. We refer to this effect as attribution shift, as the attribution scores of the non-occluded features drift from their initial values. To challenge these major issues that render explanations unstable, we introduce XtrAIn, a training-guided attribution method that transfers the occlusion operation from the input space to the parameter space. Instead of replacing input values with hand-crafted baselines, XtrAIn follows the model's training trajectory and measures how feature-associated parameter updates affect the output logits. We further introduce Xstep, a lightweight approximation for reducing computational cost, and XtrAIn+, a target-focused variant that emphasizes updates aligned with the target class. Experiments on controlled image datasets and PAM50 breast-cancer subtype classification show that the proposed methods produce cleaner and more interpretable attribution patterns than standard attribution baselines. Overall, XtrAIn provides a training-aware perspective on feature attribution and offers a useful diagnostic tool for studying how feature-level evidence is formed during training.