🤖 AI Summary
This study addresses the limited interpretability of existing numerical explainable AI (XAI) outputs for non-expert users and the lack of systematic investigation into factors affecting natural language explanation quality. Through a factorial experiment in time series forecasting, the authors systematically evaluate the impact of prediction models, XAI methods, large language models (LLMs), and prompting strategies on explanation quality, generating and automatically assessing 660 explanations. The work reveals, for the first time, that LLM choice predominantly governs explanation quality, identifies an “interpretability paradox” wherein more accurate models like SARIMAX yield poorer explanations, demonstrates that zero-shot prompting achieves performance close to self-consistent reasoning at substantially lower cost, and finds chain-of-thought prompting detrimental in this context. DeepSeek-R1 emerges as the top-performing LLM, while XAI provides only marginal benefits for expert users.
📝 Abstract
Explainable AI (XAI) methods like SHAP and LIME produce numerical feature attributions that remain inaccessible to non expert users. Prior work has shown that Large Language Models (LLMs) can transform these outputs into natural language explanations (NLEs), but it remains unclear which factors contribute to high-quality explanations. We present a systematic factorial study investigating how Forecasting model choice, XAI method, LLM selection, and prompting strategy affect NLE quality. Our design spans four models (XGBoost (XGB), Random Forest (RF), Multilayer Perceptron (MLP), and SARIMAX - comparing black-box Machine-Learning (ML) against classical time-series approaches), three XAI conditions (SHAP, LIME, and a no-XAI baseline), three LLMs (GPT-4o, Llama-3-8B, DeepSeek-R1), and eight prompting strategies. Using G-Eval, an LLM-as-a-judge evaluation method, with dual LLM judges and four evaluation criteria, we evaluate 660 explanations for time-series forecasting. Our results suggest that: (1) XAI provides only small improvements over no-XAI baselines, and only for expert audiences; (2) LLM choice dominates all other factors, with DeepSeek-R1 outperforming GPT-4o and Llama-3; (3) we observe an interpretability paradox: in our setting, SARIMAX yielded lower NLE quality than ML models despite higher prediction accuracy; (4) zero-shot prompting is competitive with self-consistency at 7-times lower cost; and (5) chain-of-thought hurts rather than helps.