Towards Robust and Accurate Stability Estimation of Local Surrogate Models in Text-based Explainable AI

📅 2025-01-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inaccuracy of stability assessment for local surrogate models (e.g., LIME) in text-based explainable AI (XAI) under adversarial perturbations. We identify a fundamental flaw in prevailing surface-similarity metrics: their neglect of semantic equivalence renders them overly sensitive and prone to erroneous stability judgments. To remedy this, we propose the first weighted similarity framework that explicitly incorporates lexical semantic similarity—modeling synonymy—into XAI stability evaluation. Our method integrates pre-trained word embeddings for semantic representation with ranked-list similarity computation. Evaluated across multiple NLP benchmark datasets, it reduces average stability estimation error by 42% compared to conventional metrics. This yields significantly more faithful and reliable assessments of adversarial robustness in explanation fidelity. The approach thus establishes a more trustworthy foundation for deploying XAI systems in high-stakes domains such as legal reasoning and decision support.

Technology Category

Application Category

📝 Abstract
Recent work has investigated the concept of adversarial attacks on explainable AI (XAI) in the NLP domain with a focus on examining the vulnerability of local surrogate methods such as Lime to adversarial perturbations or small changes on the input of a machine learning (ML) model. In such attacks, the generated explanation is manipulated while the meaning and structure of the original input remain similar under the ML model. Such attacks are especially alarming when XAI is used as a basis for decision making (e.g., prescribing drugs based on AI medical predictors) or for legal action (e.g., legal dispute involving AI software). Although weaknesses across many XAI methods have been shown to exist, the reasons behind why remain little explored. Central to this XAI manipulation is the similarity measure used to calculate how one explanation differs from another. A poor choice of similarity measure can lead to erroneous conclusions about the stability or adversarial robustness of an XAI method. Therefore, this work investigates a variety of similarity measures designed for text-based ranked lists referenced in related work to determine their comparative suitability for use. We find that many measures are overly sensitive, resulting in erroneous estimates of stability. We then propose a weighting scheme for text-based data that incorporates the synonymity between the features within an explanation, providing more accurate estimates of the actual weakness of XAI methods to adversarial examples.
Problem

Research questions and friction points this paper is trying to address.

Local Explanatory Models
Adversarial Attacks
Stability Assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

weighted scheme
textual explanation similarity
XAI robustness evaluation
🔎 Similar Papers
No similar papers found.