🤖 AI Summary
Existing interpretability methods often erroneously attribute specific computational roles to attention heads without verifying their generalization across diverse prompts. This work proposes the KID role classification framework and a three-stage analysis pipeline that combines activation patching with same-answer control conditions to expose widespread pseudo-semantic specificity in conventional attribution approaches. By integrating capability-selective screening (CSS), singular value decomposition (SVD), and activation transduction under matched controls, we systematically evaluate attention head functionality across multiple 7–8B instruction-tuned models. Our findings demonstrate that the majority of attention heads previously identified by standard methods as performing specific roles fail to consistently transfer their purported computational functions across different prompts, thereby challenging the dominant attribution paradigm in mechanistic interpretability.
📝 Abstract
In mechanistic interpretability, attention heads are commonly elevated to role claims (e.g., "this head represents addition") when they are necessary for a behavior, encode it linearly, and recover that behavior when restored after ablation. We show this evidence is insufficient: across three 7-8B instruction-tuned models and five computation families, heads passing all three checks routinely fail to transfer the computation when their activations are patched into a different prompt under matched controls. We introduce KID (Knowing / Intent / Doing), a role-assignment lens for attention heads, and pair it with a three-stage pipeline: capability-selective screening (CSS), singular value decomposition (SVD), and activation transduction under matched controls. Our results document a preliminary role taxonomy (including prompt-trajectory stabilizers, answer-side logit-bias heads, and soft computation-pattern carriers) and show that the same-answer control (a transduction target sharing the answer string but not the requested computation) is an underused check that exposes broad state transfer masquerading as semantic specificity.