Guiding LLMs to Generate High-Fidelity and High-Quality Counterfactual Explanations for Text Classification

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Generating high-fidelity counterfactual explanations for text classification without fine-tuning the classifier remains challenging. Method: We propose an LLM-guided lightweight counterfactual generation framework that integrates frozen classifier decision signals into LLaMA/GPT inference via gradient-aware prompting and output-space constraints, enabling faithful label-flipping. Contribution/Results: We introduce two novel fine-tuning-free classifier-guidance mechanisms; empirically reveal that LLMs inherently rely more on parametric knowledge than on classifier logic during counterfactual generation; and demonstrate that counterfactual data augmentation improves classifier robustness. Our method achieves significant gains over SOTA across multiple benchmarks: +12.7% in counterfactual validity, alongside consistent improvements in BLEU and BERTScore, with strong cross-LLM generalization.

Technology Category

Application Category

📝 Abstract

The need for interpretability in deep learning has driven interest in counterfactual explanations, which identify minimal changes to an instance that change a model's prediction. Current counterfactual (CF) generation methods require task-specific fine-tuning and produce low-quality text. Large Language Models (LLMs), though effective for high-quality text generation, struggle with label-flipping counterfactuals (i.e., counterfactuals that change the prediction) without fine-tuning. We introduce two simple classifier-guided approaches to support counterfactual generation by LLMs, eliminating the need for fine-tuning while preserving the strengths of LLMs. Despite their simplicity, our methods outperform state-of-the-art counterfactual generation methods and are effective across different LLMs, highlighting the benefits of guiding counterfactual generation by LLMs with classifier information. We further show that data augmentation by our generated CFs can improve a classifier's robustness. Our analysis reveals a critical issue in counterfactual generation by LLMs: LLMs rely on parametric knowledge rather than faithfully following the classifier.

Problem

Research questions and friction points this paper is trying to address.

Generating high-quality counterfactual explanations without task-specific fine-tuning.

Improving label-flipping counterfactuals using classifier-guided LLM approaches.

Enhancing classifier robustness through counterfactual data augmentation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Classifier-guided LLMs for counterfactual generation

Eliminates need for task-specific fine-tuning

Improves classifier robustness via data augmentation

🔎 Similar Papers

Evaluating the Reliability of Self-Explanations in Large Language Models