🤖 AI Summary
This work addresses the instability and potential collapse of large language models during reinforcement learning training, often caused by discrepancies between training and inference environments. To mitigate this issue, the authors propose the Discrepancy-Constrained Markov Decision Process (DCMDP), which introduces a discrepancy-tolerant region and employs a Lagrangian relaxation mechanism to dynamically balance reward maximization with behavioral alignment. By integrating black-box discrepancy modeling and heterogeneous alignment techniques, DCMDP establishes the first training paradigm capable of adaptively modulating exploration and alignment based on the degree of environmental discrepancy. Empirical results demonstrate that this approach significantly enhances the performance of both Qwen-3-8B and Qwen-3-30B-A3B models, achieving high-fidelity training while maintaining low-cost inference deployment.
📝 Abstract
Reinforcement Learning (RL) has emerged as a pivotal post-training paradigm, yet it frequently suffers from unpredictable sub-optimum performance or even training collapses. Recent findings attribute these failures to a hidden train-inference discrepancy (or mismatch), stemming from the disparate underlying engines and architecture. We find that the training policy can actively self-correct such a discrepancy when provided with an appropriate learning signal. Then, we further empirically identify a discrepancy tolerance region: within this region, aggressively narrowing the discrepancy can suppress policy exploration and reduce learning efficiency, whereas outside this region, reducing excessive discrepancy improves optimization consistency and raises the achievable local performance ceiling. According to such findings, we formulate this problem as a Discrepancy-Constrained Markov Decision Process (DCMDP), where reward maximization is coupled with a constraint that aligns training-Inference behavior, achieving stable dual-objective optimization. To adaptively balance performance improvement and discrepancy control, we introduce a Lagrangian relaxation mechanism that dynamically adjusts the relative weight of the two objectives according to the current degree of discrepancy violation. This enables stable dual-objective optimization: the policy is allowed to explore freely within the tolerance region, while being guided back when the discrepancy exceeds the safe boundary. Empirically, DCMDP significantly improves the performance of 8B dense model (Qwen-3-8b) and 30B Mixture-of-Expert model (Qwen-3-30bA3b), and enables a heterogeneous training paradigm, where LLMs can be optimized in high-fidelity training setup while being explicitly aligned for low-cost, resource-constrained inference deployment.