π€ AI Summary
This study investigates whether the classification capability and interpretability of pretrained vision models can be jointly enhanced. Addressing the unclear relationship between these two properties, we propose the Intrinsic Interpretability Score (IIS)βa metric quantifying the proportion of interpretable semantics in model representations via explanation methodsβand theoretically and empirically establish a significant positive correlation between IIS and classification accuracy. Building on this insight, we design a fine-grained explanation-guided fine-tuning paradigm that directly optimizes IIS to drive interpretability-aware representation learning. Extensive evaluation across multiple vision tasks demonstrates that our method not only improves classification accuracy but also substantially reduces performance degradation induced by post-hoc explanations. Our core contributions are twofold: (1) the first quantitative characterization of a positive correlation between interpretability and classification capability in vision representations; and (2) the first training paradigm that explicitly maximizes interpretability as an optimization objective while concurrently enhancing downstream classification performance.
π Abstract
The visual representation of a pre-trained model prioritizes the classifiability on downstream tasks, while the widespread applications for pre-trained visual models have posed new requirements for representation interpretability. However, it remains unclear whether the pre-trained representations can achieve high interpretability and classifiability simultaneously. To answer this question, we quantify the representation interpretability by leveraging its correlation with the ratio of interpretable semantics within the representations. Given the pre-trained representations, only the interpretable semantics can be captured by interpretations, whereas the uninterpretable part leads to information loss. Based on this fact, we propose the Inherent Interpretability Score (IIS) that evaluates the information loss, measures the ratio of interpretable semantics, and quantifies the representation interpretability. In the evaluation of the representation interpretability with different classifiability, we surprisingly discover that the interpretability and classifiability are positively correlated, i.e., representations with higher classifiability provide more interpretable semantics that can be captured in the interpretations. This observation further supports two benefits to the pre-trained representations. First, the classifiability of representations can be further improved by fine-tuning with interpretability maximization. Second, with the classifiability improvement for the representations, we obtain predictions based on their interpretations with less accuracy degradation. The discovered positive correlation and corresponding applications show that practitioners can unify the improvements in interpretability and classifiability for pre-trained vision models. Codes are available at https://github.com/ssfgunner/IIS.