From XAI to Stories: A Factorial Study of LLM-Generated Explanation Quality
By: Fabian Lukassen , Jan Herrmann , Christoph Weisser and more
Potential Business Impact:
Makes AI explanations easy for anyone to understand.
Explainable AI (XAI) methods like SHAP and LIME produce numerical feature attributions that remain inaccessible to non expert users. Prior work has shown that Large Language Models (LLMs) can transform these outputs into natural language explanations (NLEs), but it remains unclear which factors contribute to high-quality explanations. We present a systematic factorial study investigating how Forecasting model choice, XAI method, LLM selection, and prompting strategy affect NLE quality. Our design spans four models (XGBoost (XGB), Random Forest (RF), Multilayer Perceptron (MLP), and SARIMAX - comparing black-box Machine-Learning (ML) against classical time-series approaches), three XAI conditions (SHAP, LIME, and a no-XAI baseline), three LLMs (GPT-4o, Llama-3-8B, DeepSeek-R1), and eight prompting strategies. Using G-Eval, an LLM-as-a-judge evaluation method, with dual LLM judges and four evaluation criteria, we evaluate 660 explanations for time-series forecasting. Our results suggest that: (1) XAI provides only small improvements over no-XAI baselines, and only for expert audiences; (2) LLM choice dominates all other factors, with DeepSeek-R1 outperforming GPT-4o and Llama-3; (3) we observe an interpretability paradox: in our setting, SARIMAX yielded lower NLE quality than ML models despite higher prediction accuracy; (4) zero-shot prompting is competitive with self-consistency at 7-times lower cost; and (5) chain-of-thought hurts rather than helps.
Similar Papers
Evaluating the Effectiveness of XAI Techniques for Encoder-Based Language Models
Computation and Language
Helps understand how AI makes decisions.
LLMs for Explainable AI: A Comprehensive Survey
Artificial Intelligence
Makes confusing AI easy for people to understand.
From latent factors to language: a user study on LLM-generated explanations for an inherently interpretable matrix-based recommender system
Artificial Intelligence
Helps people understand why computers suggest things.