🤖 AI Summary
This study provides the first empirical evidence that post-hoc feature attribution methods exhibit intrinsic gender bias: their explanatory performance—measured across faithfulness, robustness, and complexity—differs significantly (average gap: 18.7%) across gender subgroups, independent of training data bias—even after fine-tuning on debiased datasets. We evaluate three NLP tasks and five mainstream language models using quantitative metrics including Infidelity, ROAR, and Complexity Score to establish a cross-model assessment framework. Our core contribution is formalizing “explanation fairness” as a third foundational pillar—alongside model fairness and interpretability—and advocating its integration into AI regulatory frameworks. Results demonstrate that the explanation mechanism itself—not merely the underlying model or data—is a primary source of bias, with critical implications for algorithmic governance in high-stakes applications.
📝 Abstract
While research on applications and evaluations of explanation methods continues to expand, fairness of the explanation methods concerning disparities in their performance across subgroups remains an often overlooked aspect. In this paper, we address this gap by showing that, across three tasks and five language models, widely used post-hoc feature attribution methods exhibit significant gender disparity with respect to their faithfulness, robustness, and complexity. These disparities persist even when the models are pre-trained or fine-tuned on particularly unbiased datasets, indicating that the disparities we observe are not merely consequences of biased training data. Our results highlight the importance of addressing disparities in explanations when developing and applying explainability methods, as these can lead to biased outcomes against certain subgroups, with particularly critical implications in high-stakes contexts. Furthermore, our findings underscore the importance of incorporating the fairness of explanations, alongside overall model fairness and explainability, as a requirement in regulatory frameworks.