When AI Shows Its Work, Is It Actually Working? Step-Level Evaluation Reveals Frontier Language Models Frequently Bypass Their Own Reasoning

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This work investigates whether the reasoning steps generated by state-of-the-art large language models genuinely guide their decisions or merely serve as post-hoc justifications. To address this, the authors propose a step-level ablation evaluation method that requires only API access—no model weights—by systematically removing individual reasoning sentences and measuring the resulting changes in final answers. Applied across diverse tasks including sentiment analysis, mathematical reasoning, and medical diagnosis, the approach evaluates the faithfulness of model reasoning. Experiments on ten prominent models reveal that for most, answer changes occur in fewer than 17% of cases upon removal of any reasoning step, suggesting largely decorative reasoning. Notably, only MiniMax-M2.5 and Kimi-K2.5 exhibit higher reasoning dependence in specific tasks. This study presents the first fine-grained, quantitative analysis of reasoning faithfulness under black-box conditions.

Technology Category

Application Category

📝 Abstract

Language models increasingly "show their work" by writing step-by-step reasoning before answering. But are these reasoning steps genuinely used, or decorative narratives generated after the model has already decided? Consider: a medical AI writes "The patient's eosinophilia and livedo reticularis following catheterization suggest cholesterol embolization syndrome. Answer: B." If we remove the eosinophilia observation, does the diagnosis change? For most frontier models, the answer is no - the step was decorative. We introduce step-level evaluation: remove one reasoning sentence at a time and check whether the answer changes. This simple test requires only API access -- no model weights -- and costs approximately $1-2 per model per task. Testing 10 frontier models (GPT-5.4, Claude Opus, DeepSeek-V3.2, MiniMax-M2.5, Kimi-K2.5, and others) across sentiment, mathematics, topic classification, and medical QA (N=376-500 each), the majority produce decorative reasoning: removing any step changes the answer less than 17% of the time, while any single step alone recovers the answer. This holds even on math, where smaller models (0.8-8B) show genuine step dependence (55% necessity). Two models break the pattern: MiniMax-M2.5 on sentiment (37% necessity) and Kimi-K2.5 on topic classification (39%) - but both shortcut other tasks. Faithfulness is model-specific and task-specific. We also discover "output rigidity": on the same medical questions, Claude Opus writes 11 diagnostic steps while GPT-OSS-120B outputs a single token. Mechanistic analysis (attention patterns) confirms that CoT attention drops more in late layers for decorative tasks (33%) than faithful ones (20%). Implications: step-by-step explanations from frontier models are largely decorative, per-model per-domain evaluation is essential, and training objectives - not scale - determine whether reasoning is genuine.

Problem

Research questions and friction points this paper is trying to address.

chain-of-thought

reasoning faithfulness

step-level evaluation

decorative reasoning

language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

step-level evaluation

chain-of-thought faithfulness

decorative reasoning