Enhancing Pre-trained Representation Classifiability can Boost its Interpretability

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This study investigates whether the classification capability and interpretability of pretrained vision models can be jointly enhanced. Addressing the unclear relationship between these two properties, we propose the Intrinsic Interpretability Score (IIS)—a metric quantifying the proportion of interpretable semantics in model representations via explanation methods—and theoretically and empirically establish a significant positive correlation between IIS and classification accuracy. Building on this insight, we design a fine-grained explanation-guided fine-tuning paradigm that directly optimizes IIS to drive interpretability-aware representation learning. Extensive evaluation across multiple vision tasks demonstrates that our method not only improves classification accuracy but also substantially reduces performance degradation induced by post-hoc explanations. Our core contributions are twofold: (1) the first quantitative characterization of a positive correlation between interpretability and classification capability in vision representations; and (2) the first training paradigm that explicitly maximizes interpretability as an optimization objective while concurrently enhancing downstream classification performance.

Technology Category

Application Category

📝 Abstract

The visual representation of a pre-trained model prioritizes the classifiability on downstream tasks, while the widespread applications for pre-trained visual models have posed new requirements for representation interpretability. However, it remains unclear whether the pre-trained representations can achieve high interpretability and classifiability simultaneously. To answer this question, we quantify the representation interpretability by leveraging its correlation with the ratio of interpretable semantics within the representations. Given the pre-trained representations, only the interpretable semantics can be captured by interpretations, whereas the uninterpretable part leads to information loss. Based on this fact, we propose the Inherent Interpretability Score (IIS) that evaluates the information loss, measures the ratio of interpretable semantics, and quantifies the representation interpretability. In the evaluation of the representation interpretability with different classifiability, we surprisingly discover that the interpretability and classifiability are positively correlated, i.e., representations with higher classifiability provide more interpretable semantics that can be captured in the interpretations. This observation further supports two benefits to the pre-trained representations. First, the classifiability of representations can be further improved by fine-tuning with interpretability maximization. Second, with the classifiability improvement for the representations, we obtain predictions based on their interpretations with less accuracy degradation. The discovered positive correlation and corresponding applications show that practitioners can unify the improvements in interpretability and classifiability for pre-trained vision models. Codes are available at https://github.com/ssfgunner/IIS.

Problem

Research questions and friction points this paper is trying to address.

Quantifying interpretability of pre-trained representations using semantic ratios

Investigating correlation between representation interpretability and classifiability

Enhancing classifiability through interpretability maximization in vision models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Inherent Interpretability Score to quantify representation interpretability

Discovers positive correlation between classifiability and interpretability in representations

Enhances classifiability via fine-tuning with interpretability maximization

🔎 Similar Papers

Restyling Unsupervised Concept Based Interpretable Networks with Generative Models

2024-07-01arXiv.orgCitations: 0

💼 Related Jobs

Researcher, Interpretability

OpenAI

$295K – $445K • Offers Equity

San Francisco

Research Scientist, Interpretability

Anthropic

$350,000—$850,000 USD

San Francisco, CA, USA / remote (case-by-case basis)

AIML - Research Scientist, AI Interpretability & Visualization

Apple

Cambridge, United States of America

Natural Language Processing Researcher

Kitware

Arlington, Virginia

Natural Language Processing Researcher

Kitware

Clifton Park, New York / Carrboro, North Carolina / Minneapolis, MN

Authors to Follow