The Disagreement Problem in Explainable Machine Learning: A Practitioner's Perspective

📅 2022-02-03
🏛️ Trans. Mach. Learn. Res.
📈 Citations: 173
Influential: 27
📄 PDF
🤖 AI Summary
This paper addresses the pervasive “explanation inconsistency” problem in post-hoc interpretability—where distinct explanation methods assign contradictory feature attributions to the same model prediction. We formalize explanation divergence, then conduct a systematic empirical investigation involving in-depth interviews with 24 data scientists, extensive evaluation across 4 datasets, 6 models, and 6 explanation methods, and controlled online user studies. Our findings reveal that explanation divergence occurs frequently and lacks principled theoretical criteria for resolution. We introduce the first quantitative framework for measuring and characterizing such divergence. Crucially, we find practitioners rely heavily on ad hoc heuristics—e.g., favoring local versus global methods or visually intuitive outputs—yet these strategies exhibit low reliability and poor reproducibility, undermining decision trustworthiness in high-stakes applications. Our work establishes foundational insights for robust interpretability evaluation and paves the way for principled, synergistic explanation paradigms.
📝 Abstract
As various post hoc explanation methods are increasingly being leveraged to explain complex models in high-stakes settings, it becomes critical to develop a deeper understanding of if and when the explanations output by these methods disagree with each other, and how such disagreements are resolved in practice. However, there is little to no research that provides answers to these critical questions. In this work, we introduce and study the disagreement problem in explainable machine learning. More specifically, we formalize the notion of disagreement between explanations, analyze how often such disagreements occur in practice, and how practitioners resolve these disagreements. We first conduct interviews with data scientists to understand what constitutes disagreement between explanations generated by different methods for the same model prediction and introduce a novel quantitative framework to formalize this understanding. We then leverage this framework to carry out a rigorous empirical analysis with four real-world datasets, six state-of-the-art post hoc explanation methods, and six different predictive models, to measure the extent of disagreement between the explanations generated by various popular explanation methods. In addition, we carry out an online user study with data scientists to understand how they resolve the aforementioned disagreements. Our results indicate that (1) state-of-the-art explanation methods often disagree in terms of the explanations they output, and (2) machine learning practitioners often employ ad hoc heuristics when resolving such disagreements. These findings suggest that practitioners may be relying on misleading explanations when making consequential decisions. They also underscore the importance of developing principled frameworks for effectively evaluating and comparing explanations output by various explanation techniques.
Problem

Research questions and friction points this paper is trying to address.

Understanding disagreement between post hoc explanation methods in machine learning.
Analyzing frequency and resolution of explanation disagreements in practice.
Developing frameworks to evaluate and compare explanation methods effectively.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces quantitative framework for explanation disagreements
Analyzes disagreement frequency among explanation methods
Studies practitioner heuristics for resolving explanation conflicts
🔎 Similar Papers
No similar papers found.