Self-Critique and Refinement for Faithful Natural Language Explanations

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) often generate natural language explanations (NLEs) post-hoc that lack faithfulness to their actual reasoning processes. To address this, we propose SR-NLE—a novel unsupervised, iterative self-critique framework for enhancing NLE faithfulness. SR-NLE integrates natural language self-feedback, fine-grained word-level feature attribution (via gradients and attention), and multi-round prompt-guided LLM self-refinement—requiring neither external annotations nor model fine-tuning. Experiments across three benchmark datasets and four state-of-the-art LLMs demonstrate a significant reduction in average faithlessness rate from 54.81% to 36.02% (an absolute improvement of 18.79%), substantially boosting explanation credibility. Our key contributions are: (i) the first application of self-critique to improve explanation faithfulness; and (ii) a training-free, interpretable word-level attribution feedback mechanism grounded in gradient- and attention-based feature importance.

Technology Category

Application Category

📝 Abstract
With the rapid development of large language models (LLMs), natural language explanations (NLEs) have become increasingly important for understanding model predictions. However, these explanations often fail to faithfully represent the model's actual reasoning process. While existing work has demonstrated that LLMs can self-critique and refine their initial outputs for various tasks, this capability remains unexplored for improving explanation faithfulness. To address this gap, we introduce Self-critique and Refinement for Natural Language Explanations (SR-NLE), a framework that enables models to improve the faithfulness of their own explanations -- specifically, post-hoc NLEs -- through an iterative critique and refinement process without external supervision. Our framework leverages different feedback mechanisms to guide the refinement process, including natural language self-feedback and, notably, a novel feedback approach based on feature attribution that highlights important input words. Our experiments across three datasets and four state-of-the-art LLMs demonstrate that SR-NLE significantly reduces unfaithfulness rates, with our best method achieving an average unfaithfulness rate of 36.02%, compared to 54.81% for baseline -- an absolute reduction of 18.79%. These findings reveal that the investigated LLMs can indeed refine their explanations to better reflect their actual reasoning process, requiring only appropriate guidance through feedback without additional training or fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

Improving faithfulness of natural language explanations from LLMs
Enhancing model self-critique for explanation refinement
Reducing unfaithfulness rates without external supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-critique and refinement framework
Natural language and feature attribution feedback
No external supervision or training needed
🔎 Similar Papers
No similar papers found.