Self-Rationalization in the Wild: A Large Scale Out-of-Distribution Evaluation on NLI-related tasks

📅 2025-02-07
📈 Citations: 1
Influential: 1
📄 PDF
🤖 AI Summary
This study investigates the capability of large language models (LLMs) to generate trustworthy free-text explanations in zero-shot settings, with a focus on out-of-distribution (OOD) generalization. We introduce the first large-scale, cross-task benchmark for explanation generation, covering 19 OOD datasets across natural language inference, fact verification, and summary hallucination detection. Our method employs fine-tuned T5-Large and OLMo-7B models integrated with a few-shot selection strategy and a novel reference-free evaluation framework—including the proposed Acceptability score—assessing explanation faithfulness, coherence, and informativeness. Key findings include: (i) a small number of high-quality annotations substantially improves OOD explanation quality; (ii) explanation quality strongly correlates with prediction accuracy; (iii) the Acceptability score achieves a Pearson correlation of 0.82 with human judgments; and (iv) data source quality exerts a significantly greater influence on OOD performance than sampling strategy.

Technology Category

Application Category

📝 Abstract
Free-text explanations are expressive and easy to understand, but many datasets lack annotated explanation data, making it challenging to train models for explainable predictions. To address this, we investigate how to use existing explanation datasets for self-rationalization and evaluate models' out-of-distribution (OOD) performance. We fine-tune T5-Large and OLMo-7B models and assess the impact of fine-tuning data quality, the number of fine-tuning samples, and few-shot selection methods. The models are evaluated on 19 diverse OOD datasets across three tasks: natural language inference (NLI), fact-checking, and hallucination detection in abstractive summarization. For the generated explanation evaluation, we conduct a human study on 13 selected models and study its correlation with the Acceptability score (T5-11B) and three other LLM-based reference-free metrics. Human evaluation shows that the Acceptability score correlates most strongly with human judgments, demonstrating its effectiveness in evaluating free-text explanations. Our findings reveal: 1) few annotated examples effectively adapt models for OOD explanation generation; 2) compared to sample selection strategies, fine-tuning data source has a larger impact on OOD performance; and 3) models with higher label prediction accuracy tend to produce better explanations, as reflected by higher Acceptability scores.
Problem

Research questions and friction points this paper is trying to address.

Evaluate models' out-of-distribution performance
Assess fine-tuning data impact on explanation generation
Correlate human judgments with Acceptability scores
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes T5-Large and OLMo-7B
Evaluates 19 diverse OOD datasets
Human study on 13 models
🔎 Similar Papers
No similar papers found.