🤖 AI Summary
Existing online verifiable reward reinforcement learning (RLVR) methods employ a uniform training strategy across all samples, neglecting the dynamic alignment between problem difficulty and model capability—leading to inefficient redundancy and insufficient guidance for high-difficulty samples.
Method: We propose a dynamic training framework integrating curriculum learning and policy optimization. It establishes an online difficulty assessment mechanism grounded in the model’s own rollout performance, enabling adaptive problem difficulty reconstruction and closed-loop evolution of training content. By synergizing verifiable rewards, dynamic difficulty evaluation, and adaptive problem recombination, the framework empowers the model to autonomously regulate its learning trajectory.
Contribution/Results: Evaluated on eight mathematical and general reasoning benchmarks, our approach achieves an average +6.96% improvement in pass@1, significantly enhancing both training efficiency and the upper bound of reasoning performance.
📝 Abstract
Recently, online Reinforcement Learning with Verifiable Rewards (RLVR) has become a key paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing methods typically treat all training samples uniformly, overlooking the vast differences in problem difficulty relative to the model's current capabilities. This uniform training strategy leads to inefficient exploration of problems the model has already mastered, while concurrently lacking effective guidance on problems that are challenging its abilities the most, limiting both learning efficiency and upper-bound performance. To address this, we propose CLPO (Curriculum-guided Learning for Policy Optimization), a novel algorithm that creates a dynamic pedagogical feedback loop within the policy optimization process. The core of CLPO leverages the model's own rollout performance to conduct real-time difficulty assessment, thereby constructing an Online Curriculum. This curriculum then guides an Adaptive Problem Restructuring mechanism, where the model acts as its own teacher: it diversifies medium-difficulty problems to promote generalization and simplifies challenging problems to make them more attainable. Our approach transforms the static training procedure into a dynamic process that co-evolves with the model's capabilities. Experiments show that CLPO achieves state-of-the-art performance across eight challenging mathematical and general reasoning benchmarks, with an average pass@1 improvement of 6.96% over other methods, demonstrating its potential for more efficiently training more capable reasoning models.