Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of low acceptance rates in multi-token prediction (MTP) during reinforcement learning (RL) training, which stems from policy entropy fluctuations and limits rollout efficiency. The study systematically investigates MTP in the post-training of large language models, uncovering a negative linear relationship between entropy variation and MTP acceptance rate. To mitigate this, the authors propose two key innovations: a rejection-sampling-based token selection strategy and an end-to-end total variation (TV) loss that eliminates the need for online MTP updates. By integrating offline MTP pretraining with an asynchronous RL framework, the method achieves a 95% MTP acceptance rate, a 25% increase in inference throughput, and up to 1.8× end-to-end training acceleration on Qwen3.5/3.6/3.7.
📝 Abstract
Reinforcement learning (RL) has become a key component in modern large language models, yet the rollout stage remains the key bottleneck in RL training pipelines. Although Multi-Token Prediction (MTP) offers a natural solution to accelerate rollouts through speculative decoding, many studies have observed that MTP acceptance rates degrade significantly during RL training, leading to limited speedup performance. To address this bottleneck, we present Bebop, a systematic study of MTP in LLM post-training, and offer practical recipes to integrate MTP into large-scale RL pipelines. First, we reveal that the MTP acceptance rate is fundamentally bounded by the fluctuation of model entropy, which demonstrates a clear negative linear relationship with the rise of entropy in the RL stage. Second, we show that probabilistic rejection sampling largely alleviates the disturbance introduced by entropy in RL compared to greedy draft sampling. We further identify that the conventional MTP training objectives (cross-entropy or KL) are suboptimal in such settings, and therefore we propose a novel end-to-end TV loss that directly optimizes multi-step rejection sampling acceptance rate, yielding ~10% acceptance rate improvements, achieving up to 95% acceptance rates and up to 25% extra inference throughput gains across mathematical reasoning, code generation, and agentic tasks. Third, we test various online MTP training strategies during RL and show that pre-RL MTP training with e2e TV loss and rejection sampling achieves a consistent acceptance rate and speedup throughout the entire RL, eliminating the need for costly online MTP updating. We provide extensive experiments and analysis that validate our findings. Experimental results show our method achieves up to 1.8x end-to-end acceleration in async RL training of Qwen3.5, Qwen3.6, and Qwen3.7 models.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Multi-Token Prediction
Rollout Acceleration
Acceptance Rate
Entropy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Token Prediction
Rejection Sampling
Entropy Bound
Total Variation Loss
Reinforcement Learning Acceleration