🤖 AI Summary
This study addresses the lack of systematic comparison among different types of natural language explanations in terms of model simulatability. It presents the first comprehensive evaluation, within a unified counterfactual simulation framework, of how feature-attribution-based verbal explanations and model-generated rationales influence human ability to predict question-answering model behavior. Leveraging large language models as judges and integrating multiple attribution methods with generative rationales, the work assesses explanation quality under a consistent evaluation paradigm by measuring their capacity to predict subsequent model responses. Results reveal significant differences between the two explanation types in enhancing counterfactual predictability, with performance jointly modulated by model architecture, explanation format, and feature granularity—highlighting the critical role of explanation form and granularity in determining simulatability.
📝 Abstract
Natural-language explanations are often treated as a unified interface for understanding model behavior, but different explanation sources may support simulation in different ways. This paper compares two families of explanations for question answering models: verbalized feature attributions and self-generated rationales. We evaluate them under a shared counterfactual simulation setting, using an LLM judge as predictor and measuring whether it can better predict a model's answers to follow-up questions when given its explanation. Across multiple instruction-tuned models, we analyze how explanation source, verbalization strategy, and feature granularity affect the simulatability of explanations. Our results show that explanation format and granularity affect simulatability: attribution-based explanations and self-generated rationales differ in how much they improve counterfactual prediction, with effects that vary across models and formats.