Internal states before wait modulate reasoning patterns

📅 2025-10-05

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This study investigates the neural mechanisms underlying “wait” behaviors—such as backtracking and self-correction—in large language model (LLM) inference, focusing on how hidden states modulate the generation of wait tokens and thereby steer subsequent reasoning paths. Method: We propose a cross-encoder hidden-state attribution method to precisely identify features governing wait-token probability, complemented by causal intervention experiments to assess interpretability and generalizability across models, including DeepSeek-R1-Distill-Llama-8B and its baselines. Contribution/Results: We首次 discover that a small set of high-impact features can selectively trigger distinct reasoning modes—including reasoning restart, knowledge retrieval, and uncertainty expression—demonstrating that LLM inference strategies are decomposable and causally controllable at the neural level. This reveals a mechanistic, intervention-ready foundation for modeling controllable inference, establishing a novel paradigm for interpretable and steerable reasoning in LLMs.

Technology Category

Application Category

📝 Abstract

Prior work has shown that a significant driver of performance in reasoning models is their ability to reason and self-correct. A distinctive marker in these reasoning traces is the token wait, which often signals reasoning behavior such as backtracking. Despite being such a complex behavior, little is understood of exactly why models do or do not decide to reason in this particular manner, which limits our understanding of what makes a reasoning model so effective. In this work, we address the question whether model's latents preceding wait tokens contain relevant information for modulating the subsequent reasoning process. We train crosscoders at multiple layers of DeepSeek-R1-Distill-Llama-8B and its base version, and introduce a latent attribution technique in the crosscoder setting. We locate a small set of features relevant for promoting/suppressing wait tokens' probabilities. Finally, through a targeted series of experiments analyzing max activating examples and causal interventions, we show that many of our identified features indeed are relevant for the reasoning process and give rise to different types of reasoning patterns such as restarting from the beginning, recalling prior knowledge, expressing uncertainty, and double-checking.

Problem

Research questions and friction points this paper is trying to address.

Understanding how latent states influence wait-token reasoning

Identifying features that modulate reasoning patterns in models

Analyzing causal interventions on reasoning behaviors like backtracking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training crosscoders at multiple model layers

Introducing latent attribution in crosscoder setting

Identifying features controlling wait token probabilities

🔎 Similar Papers

Chain-of-Probe: Examing the Necessity and Accuracy of CoT Step-by-Step