🤖 AI Summary
This study investigates how irrelevant numerical anchors interfere with numerical reasoning in language models through the anchoring effect and, for the first time, systematically uncovers the internal transmission mechanisms underlying this phenomenon. Using controlled multiple-choice tasks and a logit-difference metric to quantify anchoring behavior, the authors combine attribution analysis with circuit localization techniques to identify critical pathways in both base and instruction-tuned variants of Qwen and Llama models (7B–8B). Their findings reveal that edge-level circuit localization outperforms node-level approaches, and that anchoring signals propagate through shared internal structures. Notably, low- and high-anchor circuits exhibit strong transferability within model types but significantly diminished transferability between base and instruction-tuned models, highlighting fundamental differences in their internal representational mechanisms.
📝 Abstract
Irrelevant numbers in a prompt can shift language model judgments, producing anchoring effects in numerical reasoning. We study where this anchor-sensitive signal is carried inside language models using a controlled multiple-choice setup with shared answer options. We define a logit-difference metric comparing the correct answer option with the answer option corresponding to the anchor, and validate that it tracks behavioral anchoring. Using attribution-based circuit localization on 7B--8B Qwen and Llama base and instruction-tuned models, we find that edge-level methods recover this signal more faithfully than node-level methods. Low- and high-anchor circuits transfer strongly within a model, suggesting shared pathway structure across anchor direction. However, sparse transfer across base and instruction-tuned variants is less reliable, indicating that post-training changes which pathways matter most. Overall, our results provide a mechanistic account of how anchoring-related decision signals are carried inside language models.