Localizing Anchoring Pathways in Language Models

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study investigates how irrelevant numerical anchors interfere with numerical reasoning in language models through the anchoring effect and, for the first time, systematically uncovers the internal transmission mechanisms underlying this phenomenon. Using controlled multiple-choice tasks and a logit-difference metric to quantify anchoring behavior, the authors combine attribution analysis with circuit localization techniques to identify critical pathways in both base and instruction-tuned variants of Qwen and Llama models (7B–8B). Their findings reveal that edge-level circuit localization outperforms node-level approaches, and that anchoring signals propagate through shared internal structures. Notably, low- and high-anchor circuits exhibit strong transferability within model types but significantly diminished transferability between base and instruction-tuned models, highlighting fundamental differences in their internal representational mechanisms.

📝 Abstract

Irrelevant numbers in a prompt can shift language model judgments, producing anchoring effects in numerical reasoning. We study where this anchor-sensitive signal is carried inside language models using a controlled multiple-choice setup with shared answer options. We define a logit-difference metric comparing the correct answer option with the answer option corresponding to the anchor, and validate that it tracks behavioral anchoring. Using attribution-based circuit localization on 7B--8B Qwen and Llama base and instruction-tuned models, we find that edge-level methods recover this signal more faithfully than node-level methods. Low- and high-anchor circuits transfer strongly within a model, suggesting shared pathway structure across anchor direction. However, sparse transfer across base and instruction-tuned variants is less reliable, indicating that post-training changes which pathways matter most. Overall, our results provide a mechanistic account of how anchoring-related decision signals are carried inside language models.

Problem

Research questions and friction points this paper is trying to address.

anchoring effect

language models

numerical reasoning

circuit localization

attribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

anchoring effect

circuit localization

attribution methods