🤖 AI Summary
This work addresses the lack of mechanistic understanding regarding the robustness of image models under global perturbations—such as noise and adversarial attacks. To this end, we propose I-ASIDE: the first model-agnostic, spectrum-attribution fusion framework for analyzing global robustness. I-ASIDE jointly models spectral signal-to-noise ratio (SNR) degradation and employs Shapley-value-based axiomatic attribution to quantify, for the first time, the intrinsic trade-off wherein low-frequency features predominantly govern robustness while high-frequency features primarily sustain accuracy. Evaluated on ImageNet-pretrained models, I-ASIDE achieves high-fidelity robustness assessment (mCE correlation > 0.92) and delivers interpretable, feature-level attribution visualizations. Unlike prior approaches constrained to local perturbations or specific architectures, I-ASIDE establishes a unified, interpretable analytical paradigm for probing robustness mechanisms across diverse models.
📝 Abstract
Perturbation robustness evaluates the vulnerabilities of models, arising from a variety of perturbations, such as data corruptions and adversarial attacks. Understanding the mechanisms of perturbation robustness is critical for global interpretability. We present a model-agnostic, global mechanistic interpretability method to interpret the perturbation robustness of image models. This research is motivated by two key aspects. First, previous global interpretability works, in tandem with robustness benchmarks, e.g. mean corruption error (mCE), are not designed to directly interpret the mechanisms of perturbation robustness within image models. Second, we notice that the spectral signal-to-noise ratios (SNR) of perturbed natural images exponentially decay over the frequency. This power-law-like decay implies that: Low-frequency signals are generally more robust than high-frequency signals -- yet high classification accuracy can not be achieved by low-frequency signals alone. By applying Shapley value theory, our method axiomatically quantifies the predictive powers of robust features and non-robust features within an information theory framework. Our method, dubbed as extbf{I-ASIDE} ( extbf{I}mage extbf{A}xiomatic extbf{S}pectral extbf{I}mportance extbf{D}ecomposition extbf{E}xplanation), provides a unique insight into model robustness mechanisms. We conduct extensive experiments over a variety of vision models pre-trained on ImageNet to show that extbf{I-ASIDE} can not only extbf{measure} the perturbation robustness but also extbf{provide interpretations} of its mechanisms.