🤖 AI Summary
Existing XAI methods struggle to accommodate the non-numerical nature of large language models (LLMs), limiting their interpretability and effectiveness in bias detection. To address this, we propose a text-to-ordinal mapping strategy that transforms discrete textual inputs and outputs into quantifiable representations, and systematically inject diverse nonlinear, multivariate biases to construct a rigorous benchmarking framework. Building upon this, we introduce RuleSHAP—a novel algorithm that synergizes SHAP’s local feature attribution with RuleFit’s rule-based interpretability—enabling high-precision identification of conjunctive and non-convex bias patterns. Experiments demonstrate that RuleSHAP achieves a 94% average improvement in MRR@1 over RuleFit for bias localization, marking the first method capable of both high-accuracy and human-interpretable detection of complex nonlinear biases in LLMs. This work establishes a new paradigm for auditable, explainable bias assessment in LLMs.
📝 Abstract
Generative AI systems can help spread information but also misinformation and biases, potentially undermining the UN Sustainable Development Goals (SDGs). Explainable AI (XAI) aims to reveal the inner workings of AI systems and expose misbehaviours or biases. However, current XAI tools, built for simpler models, struggle to handle the non-numerical nature of large language models (LLMs). This paper examines the effectiveness of global XAI methods, such as rule-extraction algorithms and SHAP, in detecting bias in LLMs. To do so, we first show a text-to-ordinal mapping strategy to convert non-numerical inputs/outputs into numerical features, enabling these tools to identify (some) misinformation-related biases in LLM-generated content. Then, we inject non-linear biases of varying complexity (univariate, conjunctive, and non-convex) into widespread LLMs like ChatGPT and Llama via system instructions, using global XAI methods to detect them. This way, we found that RuleFit struggles with conjunctive and non-convex biases, while SHAP can approximate conjunctive biases but cannot express them as actionable rules. Hence, we introduce RuleSHAP, a global rule extraction algorithm combining SHAP and RuleFit to detect more non-univariate biases, improving injected bias detection over RuleFit by +94% (MRR@1) on average.