🤖 AI Summary
Existing implicit state recurrent reasoning methods struggle to integrate with on-policy reinforcement learning and lack causal interpretability. This work proposes the SWITCH framework, which introduces explicit boundary tokens <swi> and </swi> to enable a switchable implicit reasoning mechanism. By combining a visible-to-implicit curriculum learning strategy, a Switch-GRPO objective function, and a boundary-token-based causal probing technique, SWITCH is the first approach to simultaneously address training optimization and interpretability challenges through discrete boundaries. Experimental results demonstrate that SWITCH outperforms existing methods at comparable model scales, while mechanistic analysis confirms the causal significance and computational focus of localized switching strategies within implicit reasoning.
📝 Abstract
Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is that a single pair of explicit boundary tokens can address both issues at once: discrete entry and exit anchors make the latent block compatible with standard on-policy RL, and the same anchors offer a natural foothold for mechanistic analysis. Motivated by this, we propose SWITCH, a switchable latent reasoning framework. The model emits <swi> to enter latent mode and </swi> to exit. Because the boundaries are ordinary discrete tokens, the GRPO policy ratio is well-defined at every decision point. The same anchors also expose the latent steps to direct probing and causal intervention. We train the model with a visible-to-latent curriculum and a Switch-GRPO objective that propagates gradients through recurrent latent computation. SWITCH consistently outperforms prior hidden-state-recurrence latent reasoning approaches at similar scale. Mechanistic analysis through the boundary tokens further reveals three findings: (i) <swi> is a sharply localised, learned switching policy rather than a stylistic artefact; (ii) the latent step it opens performs problem-specific, causally important computation rather than acting as an inert placeholder; and (iii) that computation is concentrated at a single hidden-state transition on entry. Together, these results show that hidden-state-recurrence latent reasoning is both RL-trainable and open to direct mechanistic analysis, including of how on-policy RL itself improves the model from the inside.