FaCT: Faithful Concept Traces for Explaining Neural Network Decisions

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the prevalent issue of overly strong assumptions—such as class specificity, locality, and alignment with human priors—in existing deep learning concept-based explanations. We propose an intrinsic, model-agnostic concept explanation framework that employs a learnable concept tracing mechanism to faithfully extract and quantify shared concepts across classes, supporting concept–logit contribution analysis and input visualization at arbitrary network layers. A key innovation is the introduction of the C²-Score, an unsupervised, scalable metric grounded in foundation models, enabling the first quantitative evaluation of concept consistency without ground-truth supervision. On ImageNet, our method achieves state-of-the-art performance while demonstrating significant quantitative improvements in concept consistency. User studies confirm that the extracted concepts are more interpretable and comprehensible, outperforming mainstream post-hoc explanation methods across all evaluated dimensions.

Technology Category

Application Category

📝 Abstract

Deep networks have shown remarkable performance across a wide range of tasks, yet getting a global concept-level understanding of how they function remains a key challenge. Many post-hoc concept-based approaches have been introduced to understand their workings, yet they are not always faithful to the model. Further, they make restrictive assumptions on the concepts a model learns, such as class-specificity, small spatial extent, or alignment to human expectations. In this work, we put emphasis on the faithfulness of such concept-based explanations and propose a new model with model-inherent mechanistic concept-explanations. Our concepts are shared across classes and, from any layer, their contribution to the logit and their input-visualization can be faithfully traced. We also leverage foundation models to propose a new concept-consistency metric, C$^2$-Score, that can be used to evaluate concept-based methods. We show that, compared to prior work, our concepts are quantitatively more consistent and users find our concepts to be more interpretable, all while retaining competitive ImageNet performance.

Problem

Research questions and friction points this paper is trying to address.

Develops faithful concept traces to explain neural network decisions

Addresses limitations of unfaithful and restrictive concept-based explanations

Proposes model-inherent concepts shared across classes with traceable contributions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Model-inherent mechanistic concept-explanations for faithfulness

Cross-class shared concepts with traceable logit contributions

Foundation model leveraged concept-consistency evaluation metric

🔎 Similar Papers

No similar papers found.

Authors to Follow