LLM Explainability with Counterfactual Chains and Causal Graphs

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the lack of concept-level transparency in the internal reasoning processes of large language models (LLMs). To this end, it introduces a novel four-stage framework that models high-level LLM reasoning through causal graphs. The approach begins by extracting interpretable, discriminative concepts and mapping inputs to concept states. It then enriches the data using MCMC-inspired counterfactual chains and applies the σ-CG algorithm to achieve stable causal discovery. Evaluated on disease diagnosis, sentiment analysis, and LLM-as-a-judge tasks, the method yields causal graphs that maintain high predictive fidelity and structural stability while exhibiting semantic dependencies aligned with the LLM’s actual reasoning behavior, thereby achieving a significant advance in concept-level interpretability.

📝 Abstract

Causal graphs provide a high-level language for making mechanisms transparent. Recent work uses Large Language Models (LLMs) to recover causal graphs of external-world processes. Instead, in this paper, we use causal graphs to model LLM inference itself, providing stakeholders with a transparent view of how the model perceives and organizes high-level concepts to produce a prediction. We propose a four-phase method for constructing such graphs. Given a target LLM and a set of textual examples, our method discovers class-discriminative, human-interpretable concepts and maps each input to LLM-perceived concept states. We then introduce an MCMC-inspired counterfactual augmentation procedure that expands the sparse observational data through chains of counterfactuals. This enables stable causal discovery with $σ$-CG, yielding informative, interpretable graphs. We apply our method to three LLMs across disease diagnosis, sentiment analysis, and LLM-as-a-judge classification tasks. We evaluate the learned graphs for predictive fidelity and structural stability, and the MCMC-inspired augmentation for convergence and downstream utility. Our results show that the discovered causal graphs capture meaningful dependencies consistent with LLMs' reasoning. Together, this paper provides a foundation for concept-level explainability of LLMs.

Problem

Research questions and friction points this paper is trying to address.

LLM explainability

causal graphs

concept-level interpretation

counterfactual reasoning

model transparency

Innovation

Methods, ideas, or system contributions that make the work stand out.

causal graphs

counterfactual chains

concept-level explainability