Can Global XAI Methods Reveal Injected Bias in LLMs? SHAP vs Rule Extraction vs RuleSHAP

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing XAI methods struggle to accommodate the non-numerical nature of large language models (LLMs), limiting their interpretability and effectiveness in bias detection. To address this, we propose a text-to-ordinal mapping strategy that transforms discrete textual inputs and outputs into quantifiable representations, and systematically inject diverse nonlinear, multivariate biases to construct a rigorous benchmarking framework. Building upon this, we introduce RuleSHAP—a novel algorithm that synergizes SHAP’s local feature attribution with RuleFit’s rule-based interpretability—enabling high-precision identification of conjunctive and non-convex bias patterns. Experiments demonstrate that RuleSHAP achieves a 94% average improvement in MRR@1 over RuleFit for bias localization, marking the first method capable of both high-accuracy and human-interpretable detection of complex nonlinear biases in LLMs. This work establishes a new paradigm for auditable, explainable bias assessment in LLMs.

Technology Category

Application Category

📝 Abstract

Generative AI systems can help spread information but also misinformation and biases, potentially undermining the UN Sustainable Development Goals (SDGs). Explainable AI (XAI) aims to reveal the inner workings of AI systems and expose misbehaviours or biases. However, current XAI tools, built for simpler models, struggle to handle the non-numerical nature of large language models (LLMs). This paper examines the effectiveness of global XAI methods, such as rule-extraction algorithms and SHAP, in detecting bias in LLMs. To do so, we first show a text-to-ordinal mapping strategy to convert non-numerical inputs/outputs into numerical features, enabling these tools to identify (some) misinformation-related biases in LLM-generated content. Then, we inject non-linear biases of varying complexity (univariate, conjunctive, and non-convex) into widespread LLMs like ChatGPT and Llama via system instructions, using global XAI methods to detect them. This way, we found that RuleFit struggles with conjunctive and non-convex biases, while SHAP can approximate conjunctive biases but cannot express them as actionable rules. Hence, we introduce RuleSHAP, a global rule extraction algorithm combining SHAP and RuleFit to detect more non-univariate biases, improving injected bias detection over RuleFit by +94% (MRR@1) on average.

Problem

Research questions and friction points this paper is trying to address.

Evaluating global XAI methods for detecting bias in LLMs

Converting non-numerical LLM data for bias analysis

Developing RuleSHAP to improve bias detection in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-to-ordinal mapping converts LLM inputs/outputs to numerical

RuleSHAP combines SHAP and RuleFit for better bias detection

Injects and detects non-linear biases in LLMs via system instructions

🔎 Similar Papers

No similar papers found.

Authors to Follow