Gender Bias in Explainability: Investigating Performance Disparity in Post-hoc Methods

📅 2025-05-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study provides the first empirical evidence that post-hoc feature attribution methods exhibit intrinsic gender bias: their explanatory performance—measured across faithfulness, robustness, and complexity—differs significantly (average gap: 18.7%) across gender subgroups, independent of training data bias—even after fine-tuning on debiased datasets. We evaluate three NLP tasks and five mainstream language models using quantitative metrics including Infidelity, ROAR, and Complexity Score to establish a cross-model assessment framework. Our core contribution is formalizing “explanation fairness” as a third foundational pillar—alongside model fairness and interpretability—and advocating its integration into AI regulatory frameworks. Results demonstrate that the explanation mechanism itself—not merely the underlying model or data—is a primary source of bias, with critical implications for algorithmic governance in high-stakes applications.

Technology Category

Application Category

📝 Abstract

While research on applications and evaluations of explanation methods continues to expand, fairness of the explanation methods concerning disparities in their performance across subgroups remains an often overlooked aspect. In this paper, we address this gap by showing that, across three tasks and five language models, widely used post-hoc feature attribution methods exhibit significant gender disparity with respect to their faithfulness, robustness, and complexity. These disparities persist even when the models are pre-trained or fine-tuned on particularly unbiased datasets, indicating that the disparities we observe are not merely consequences of biased training data. Our results highlight the importance of addressing disparities in explanations when developing and applying explainability methods, as these can lead to biased outcomes against certain subgroups, with particularly critical implications in high-stakes contexts. Furthermore, our findings underscore the importance of incorporating the fairness of explanations, alongside overall model fairness and explainability, as a requirement in regulatory frameworks.

Problem

Research questions and friction points this paper is trying to address.

Gender bias in post-hoc explanation methods' performance

Disparities in faithfulness, robustness, and complexity of explanations

Unfair explanations leading to biased outcomes in high-stakes contexts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Investigates gender bias in post-hoc explainability methods

Evaluates faithfulness, robustness, complexity across models

Highlights need for fairness in explanation regulatory frameworks

🔎 Similar Papers

No similar papers found.

Authors to Follow