🤖 AI Summary
This study investigates whether representations captured by linear probes in the hidden states of large language models stem from genuine differences in reasoning mechanisms or are confounded by task formatting. Focusing on Qwen3-14B across three reasoning tasks, we conduct a systematic analysis integrating linear probing, residualization-based controls, intrinsic dimension estimation, convex hull contamination analysis, trajectory anchor similarity, and causal intervention experiments. Our results show that while raw probe accuracy reaches 100%, it drops to chance level after controlling for task format. Causal testing further reveals no significant functional association between the geometric structure of hidden states and reasoning patterns (p = 0.286). These findings demonstrate that standard probing analyses are highly susceptible to format confounding and underscore the necessity of routinely incorporating format-deconfounding controls in mechanistic interpretability research.
📝 Abstract
Linear probing of large language model (LLM) hidden states is widely used to claim that models learn distinct representations for different reasoning types. We test this by probing Qwen3-14B on three benchmarks spanning the classical trichotomy: LogiQA 2.0 (deductive), ARC-Challenge (inductive), and $α$NLI (abductive). At layer 32 of 40, linear probes achieve 100\% cross-validated accuracy with well-separated geometry (intrinsic dimensionalities: 20.6, 28.5, 33.6; convex hull contamination $\leq$1.5\%). However, this separation is entirely driven by format confounds. Residualizing source identity, option count, and response length reduces accuracy to chance. Trace-anchor similarity indicates largely shared reasoning across tasks (42.5\% agreement vs.\ 33.3\% chance), and causal steering with random controls ($n=20$) shows no functional link between geometry and reasoning mode ($p=0.286$). Thus, high probe accuracy reflects task format rather than computational structure, motivating routine format deconfounding in mechanistic interpretability.