🤖 AI Summary
The neural mechanisms underlying reasoning improvement in large language models (LLMs) during post-training remain poorly understood—particularly why supervised fine-tuning (SFT), knowledge distillation (KD), and reinforcement learning methods (e.g., GRPO) enhance complex reasoning despite lacking mechanistic clarity.
Method: The authors systematically analyze the dynamic evolution of attention heads across these training paradigms, integrating circuit analysis, ablation studies, and qualitative validation.
Contribution/Results: They identify, for the first time, that complex reasoning dynamically induces functionally specialized “computational” attention heads that form structured, cooperative circuits supporting stepwise inference. Contrary to prevailing hypotheses, no dedicated “thinking switch” head exists; instead, reasoning enhancement frequently incurs a trade-off—degradation in basic computational capabilities—revealing an intrinsic tension. The study maps distinct attention-head differentiation pathways under different training strategies, uncovers the neural basis of overthinking risks, and provides theoretical foundations for interpretable reasoning modeling and robust training design.
📝 Abstract
The remarkable capabilities of modern large reasoning models are largely unlocked through post-training techniques such as supervised fine-tuning and reinforcement learning. However, the architectural mechanisms behind such improvements remain largely opaque. In this work, we use circuit analysis to demonstrate that post-training for complex reasoning sparks the emergence of novel, functionally specialized attention heads. These heads collectively support structured reasoning and computation. Our comparative analysis across Qwen families and DeepSeek-distilled model reveals that these emergent heads evolve differently under different training regimes. Distillation and SFT foster a cumulative addition of stable reasoning heads. In contrast, group relative policy optimization operates in a dynamic search mode: relatively few attention heads are iteratively activated, evaluated, and pruned, with their survival closely tracking fluctuations in the task reward signal. Furthermore, we find that controllable think on/off models do not possess dedicated thinking heads. Instead, turning off explicit reasoning triggers a broader-but less efficient-set of compensatory heads. Through ablation and qualitative analyses, we connect these circuit-level dynamics to a crucial performance trade-off: strengthened heads enable sophisticated problem-solving strategies for difficult problems but can also introduce over-thinking failure modes, such as calculation errors or logical loops on simpler tasks. These findings connect circuit-level dynamics to macro-level performance, identifying an inherent tension where complex reasoning comes at the cost of elementary computations. More broadly, our work points to future directions for training policy design, emphasizing the need to balance the development of effective reasoning strategies with the assurance of reliable, flawless execution.