TARPO: Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization

📅 2026-06-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

167K/year
🤖 AI Summary
Existing continuous implicit reasoning methods struggle to effectively explore policies in reinforcement learning due to their deterministic nature. This work proposes TARPO, a novel framework that dynamically switches between token-level explicit (discrete sampling) and implicit (continuous reasoning) modes for the first time. TARPO employs a lightweight binary routing head to adaptively select the reasoning path and integrates a group-relative advantage function with a shared reward signal to enable end-to-end joint optimization. The approach preserves the stochasticity of discrete sampling while substantially enhancing representational capacity and training stability. Experiments on Qwen2.5 (1.5B–7B) and Llama-3.1-8B demonstrate that TARPO significantly outperforms current explicit and implicit reinforcement learning reasoning methods across multiple benchmark tasks.
📝 Abstract
Latent reasoning has emerged as a promising alternative to discrete Chain-of-Thought (CoT) in large language models (LLMs), enabling more expressive reasoning by operating over continuous representations. However, the inherently deterministic nature of continuous representations limits policy exploration in reinforcement learning (RL). To address this, we propose TARPO (Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization), a pure RL framework that adaptively switches between discrete token generation and continuous latent reasoning at each step. TARPO introduces a lightweight action head router that observes the current hidden state and samples a routing decision from a binary mode-selection space, preserving the stochasticity of discrete token sampling from the vocabulary. The LLM backbone and router are jointly optimized end-to-end with a shared group-relative advantage signal. Extensive experiments across Qwen2.5 (from 1.5B to 7B) and Llama-3.1-8B backbones demonstrate that TARPO consistently outperforms existing explicit and latent reasoning RL baselines across diverse benchmarks. Further analysis shows that TARPO learns adaptive token-wise switching behaviors while maintaining stable training dynamics. Our code is available at https://github.com/NKU-LITI/TARPO-master.
Problem

Research questions and friction points this paper is trying to address.

latent reasoning
Chain-of-Thought
reinforcement learning
policy exploration
token generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent-Explicit Reasoning
Action-Routing Policy Optimization
Token-Wise Switching
Reinforcement Learning for LLMs
Stochastic Policy Exploration