Reinforcement Mid-Training

πŸ“… 2025-09-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work identifies a critical gap between pretraining and post-training in large language models (LLMs): the absence of an effective intermediate phase, leading to redundant inference, imbalanced token entropy distributions, and suboptimal information utilization. To address this, we propose Reinforced Mid-Training (RMT), a novel framework incorporating three key innovations: (1) a dynamic token budget mechanism to constrain excessive inference length; (2) a curriculum-based adaptive sampling strategy aligned with token-level entropy distribution; and (3) a dual-objective training paradigm jointly optimizing reinforcement learning rewards and next-token prediction. Experiments demonstrate that RMT improves language modeling performance by 64.91% while reducing average inference length to just 21% of the baseline. In mathematical reasoning post-training, it yields a 18.76% gain. This work establishes and empirically validates, for the first time, a systematic LLM mid-training paradigm grounded in reinforcement learning.

Technology Category

Application Category

πŸ“ Abstract
The development of state-of-the-art large language models is commonly understood as a two-stage process involving pre-training and post-training. We point out the need for an additional intermediate stage called reinforcement mid-training with potential for strong performance gains. In this paper, we formally define the problem and identify three key challenges: (1) inefficient training due to excessive reasoning steps, (2) disregard of the imbalanced token entropy distribution, and (3) underutilization of token information. To address these challenges, we propose RMT, a framework for efficient, adaptive, and unified reinforcement mid-training with various innovative components. In particular, we first introduce a dynamic token budget mechanism that constrains unnecessary reasoning steps and mitigates model overthinking. Next, we design a curriculum-based adaptive sampling method that fosters a progressive learning trajectory from easy to hard tokens. Finally, we present a dual training strategy that combines reinforcement learning with next-token prediction, ensuring targeted learning on key tokens and full exploitation of all token information. Extensive experiments demonstrate the superiority of RMT over state-of-the-art methods, achieving up to +64.91% performance improvement with only 21% of the reasoning length in language modeling. We also show that checkpoints obtained after reinforcement mid-training can benefit the subsequent post-training, yielding up to +18.76% improvement in the mathematical domain.
Problem

Research questions and friction points this paper is trying to address.

Optimizing language model training efficiency by reducing unnecessary reasoning steps
Addressing imbalanced token entropy distribution during model training
Improving token information utilization through dual training strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic token budget mechanism reduces reasoning steps
Curriculum-based adaptive sampling for progressive token learning
Dual training combines reinforcement learning with token prediction
πŸ”Ž Similar Papers
No similar papers found.