Do Concept Bottleneck Models Respect Localities?

📅 2024-01-02

📈 Citations: 7

✨ Influential: 0

🤖 AI Summary

This work investigates whether Concept Bottleneck Models (CBMs) satisfy the locality assumption—that concept predictions depend solely on features genuinely relevant to each concept, rather than spurious, statistically confounded features. Method: We conduct a systematic analysis via input perturbation, causally inspired concept attribution diagnostics, theoretical modeling, and empirical evaluation across multiple benchmarks. Contribution/Results: We are the first to demonstrate that CBMs fundamentally violate locality—even under ideal conditions of concept independence and non-overlapping features—due to implicit inter-concept correlations that induce interpretability fragility. Empirically, CBMs frequently exploit spurious features to achieve high accuracy, resulting in hollow concept explanations. Locality violation is thus identified as the root cause of their compromised interpretability and robustness. Our findings expose a critical flaw in the foundational assumptions underlying concept-based interpretability and provide both theoretical warnings and practical diagnostic tools for building trustworthy concept models.

Technology Category

Application Category

📝 Abstract

Concept-based methods explain model predictions using human-understandable concepts. These models require accurate concept predictors, yet the faithfulness of existing concept predictors to their underlying concepts is unclear. In this paper, we investigate the faithfulness of Concept Bottleneck Models (CBMs), a popular family of concept-based architectures, by looking at whether they respect"localities"in datasets. Localities involve using only relevant features when predicting a concept's value. When localities are not considered, concepts may be predicted based on spuriously correlated features, degrading performance and robustness. This work examines how CBM predictions change when perturbing model inputs, and reveals that CBMs may not capture localities, even when independent concepts are localised to non-overlapping feature subsets. Our empirical and theoretical results demonstrate that datasets with correlated concepts may lead to accurate but uninterpretable models that fail to learn localities. Overall, we find that CBM interpretability is fragile, as CBMs occasionally rely upon spurious features, necessitating further research into the robustness of concept predictors.

Problem

Research questions and friction points this paper is trying to address.

Assess if concept predictors use relevant features

Evaluate concept models' adherence to locality principles

Propose solutions for improving concept distinction clarity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Assessing concept predictors' locality via metrics

Proposing solutions for spurious feature issues

Using perturbation to test concept relevance

🔎 Similar Papers

No similar papers found.

Authors to Follow