Evaluating Input Feature Explanations through a Unified Diagnostic Evaluation Framework

📅 2024-06-21

📈 Citations: 1

✨ Influential: 0

career value

155K/year

🤖 AI Summary

This work addresses the lack of unified, cross-type evaluation for input feature attribution methods. We propose the first automated, horizontal evaluation framework covering token-level, token-interaction-level, and span-interaction-level explanation methods. Grounded in four diagnostic properties—faithfulness, stability, interpretability, and robustness—the framework integrates diverse techniques including Shapley values, Integrated Gradients, Bivariate Shapley, attention mechanisms, and Louvain-based Span Interactions. It is systematically validated on two benchmark datasets (SST-2 and BoolQ) and two foundational models (BERT and RoBERTa). Results demonstrate that span-interaction explanations significantly outperform conventional approaches across most metrics, revealing their previously underappreciated potential; moreover, the three explanation types exhibit complementary strengths. This study establishes a reproducible, principled benchmark for scientifically selecting and improving attribution methods.

Technology Category

Application Category

📝 Abstract

Explaining the decision-making process of machine learning models is crucial for ensuring their reliability and transparency for end users. One popular explanation form highlights key input features, such as i) tokens (e.g., Shapley Values and Integrated Gradients), ii) interactions between tokens (e.g., Bivariate Shapley and Attention-based methods), or iii) interactions between spans of the input (e.g., Louvain Span Interactions). However, these explanation types have only been studied in isolation, making it difficult to judge their respective applicability. To bridge this gap, we develop a unified framework that facilitates an automated and direct comparison between highlight and interactive explanations comprised of four diagnostic properties. We conduct an extensive analysis across these three types of input feature explanations -- each utilizing three different explanation techniques -- across two datasets and two models, and reveal that each explanation has distinct strengths across the different diagnostic properties. Nevertheless, interactive span explanations outperform other types of input feature explanations across most diagnostic properties. Despite being relatively understudied, our analysis underscores the need for further research to improve methods generating these explanation types. Additionally, integrating them with other explanation types that perform better in certain characteristics could further enhance their overall effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Unified framework for explanation comparison

Evaluating input feature explanation types

Interactive span explanations outperform others

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for explanation comparison

Analysis of interactive span explanations

Integration of diverse explanation techniques

🔎 Similar Papers

A Multimodal Automated Interpretability Agent