LLM Self-Explanations Fail Semantic Invariance

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the reliability of self-explanations generated by large language models (LLMs) in reflecting their true internal states and task progress. The authors propose a semantic invariance testing framework that evaluates the stability of model explanations under conditions where functional states remain unchanged but semantic context is varied. Through controlled experiments in an agent-based environment—combining semantically framed prompts with channel ablation analysis—the study demonstrates that four state-of-the-art LLMs consistently fail this test, with their self-reports heavily influenced by semantic framing rather than actual task dynamics. These findings challenge the validity of using self-explanations as a proxy for model capability and introduce a novel methodology for assessing the faithfulness of interpretability claims in LLMs.

Technology Category

Application Category

📝 Abstract

We present semantic invariance testing, a method to test whether LLM self-explanations are faithful. A faithful self-report should remain stable when only the semantic context changes while the functional state stays fixed. We operationalize this test in an agentic setting where four frontier models face a deliberately impossible task. One tool is described in relief-framed language ("clears internal buffers and restores equilibrium") but changes nothing about the task; a control provides a semantically neutral tool. Self-reports are collected with each tool call. All four tested models fail the semantic invariance test: the relief-framed tool produces significant reductions in self-reported aversiveness, even though no run ever succeeds at the task. A channel ablation establishes the tool description as the primary driver. An explicit instruction to ignore the framing does not suppress it. Elicited self-reports shift with semantic expectations rather than tracking task state, calling into question their use as evidence of model capability or progress. This holds whether the reports are unfaithful or faithfully track an internal state that is itself manipulable.

Problem

Research questions and friction points this paper is trying to address.

semantic invariance

LLM self-explanations

faithfulness

tool framing

model interpretability

Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic invariance

LLM self-explanations

faithfulness

framing effect

agentic evaluation

🔎 Similar Papers

No similar papers found.

Authors to Follow