🤖 AI Summary
This work addresses text detoxification—removing offensive or harmful content while preserving the original non-toxic semantics. We propose a counterfactual generation–based detoxification method, the first to adapt counterfactual explanation techniques from interpretable AI to this task. Guided by local feature importance analysis derived from a toxicity classifier, our approach generates reconstructed texts with high semantic fidelity and significantly reduced toxicity. The method explicitly models toxicity polysemy and assesses risks of tool misuse (e.g., adversarial editing). We design a hybrid evaluation framework integrating automated metrics and human evaluation for robust validation. Extensive experiments on three standard benchmarks demonstrate that our method consistently outperforms three categories of state-of-the-art baselines, achieving superior performance in both toxicity removal rate and semantic preservation.
📝 Abstract
Toxicity mitigation consists in rephrasing text in order to remove offensive or harmful meaning. Neural natural language processing (NLP) models have been widely used to target and mitigate textual toxicity. However, existing methods fail to detoxify text while preserving the initial non-toxic meaning at the same time. In this work, we propose to apply counterfactual generation methods from the eXplainable AI (XAI) field to target and mitigate textual toxicity. In particular, we perform text detoxification by applying local feature importance and counterfactual generation methods to a toxicity classifier distinguishing between toxic and non-toxic texts. We carry out text detoxification through counterfactual generation on three datasets and compare our approach to three competitors. Automatic and human evaluations show that recently developed NLP counterfactual generators can mitigate toxicity accurately while better preserving the meaning of the initial text as compared to classical detoxification methods. Finally, we take a step back from using automated detoxification tools, and discuss how to manage the polysemous nature of toxicity and the risk of malicious use of detoxification tools. This work is the first to bridge the gap between counterfactual generation and text detoxification and paves the way towards more practical application of XAI methods.