Learning to Reason with Mixture of Tokens

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing verifiable-reward reinforcement learning (RLVR) methods for large language model reasoning sample only discrete tokens, neglecting the rich information encoded in the output probability distribution—leading to restricted search spaces and inefficient exploration. This paper proposes the Mixture-of-Tokens Generation (MoT-G) framework, the first to introduce a continuous mixture-token space into RLVR-based chain-of-thought generation. MoT-G enhances distribution-aware exploration by preserving high entropy in hidden states and fusing token embeddings via learned weights. It further incorporates Group Relative Policy Optimization to improve training stability. Evaluated on Reasoning-Gym, MoT-G achieves 5–35% absolute accuracy gains over standard decoding across 7 out of 10 tasks, while requiring only 50% of the training trajectories to match baseline performance—demonstrating substantial improvements in both reasoning accuracy and inference efficiency.

Technology Category

Application Category

📝 Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a leading approach for improving large language model (LLM) reasoning capabilities. Most current methods follow variants of Group Relative Policy Optimization, which samples multiple reasoning completions, scores them relative to each other, and adjusts the policy accordingly. However, these approaches invariably sample discrete tokens at each reasoning step, discarding the rich distributional information in the model's probability distribution over candidate tokens. While preserving and utilizing this distributional information has proven beneficial in non-RL settings, current RLVR methods seem to be unnecessarily constraining the reasoning search space by not using this information. To address this limitation, we investigate mixture-of-token generation (MoT-G) in RLVR. We present a unified framework that generalizes existing MoT-G approaches, including existing training-free methods that construct mixture embeddings as weighted sums over token embeddings, and extend RLVR to operate directly in this continuous mixture space for generating chain-of-thought. Evaluating two MoT-G variants on Reasoning-Gym, a suite of reasoning-intensive language tasks, we find that MoT--G methods achieve substantial improvements (5--35 % gains on 7 out of 10 tasks) compared to standard decoding with the Qwen2.5-1.5B model, while reaching comparable accuracy with half the number of trajectories, suggesting improved training efficiency. Through comprehensive hidden-state and token-level analyses, we provide evidence that MoT--G's benefits may stem from its ability to maintain higher hidden-state entropy throughout the reasoning process and promote exploration in token space.

Problem

Research questions and friction points this paper is trying to address.

RLVR methods discard distributional token information during reasoning

Current approaches constrain reasoning search space unnecessarily

Need to preserve continuous token distributions in chain-of-thought

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-token generation in RLVR framework

Continuous mixture space for chain-of-thought reasoning

Weighted token embedding sums preserve distributional information

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting