How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study

📅 2025-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Improving the performance of large language models (LLMs) on complex reasoning tasks—while maintaining efficiency and scalability—remains a central challenge in AI research. This paper introduces a difficulty-aware, multi-stage reinforcement learning (RL) paradigm, proposing the first cross-domain (mathematics + code) joint RL training framework. Our approach integrates quantitative difficulty estimation, dynamic curriculum scheduling, and multi-task hybrid reward modeling. By progressively exposing models to increasingly challenging tasks and applying hierarchical data filtering, we significantly enhance the generalization and reasoning capabilities of small-parameter models (1.5B). Specifically, our method achieves 42.3% accuracy on AIME-2024 and 89.5% on MATH-500—substantially outperforming same-scale baselines. To foster reproducibility and community advancement, we will open-source the curated dataset, establishing a new paradigm and benchmark resource for efficient, scalable reasoning research.

Technology Category

Application Category

📝 Abstract
Enhancing the reasoning capabilities of Large Language Models (LLMs) with efficiency and scalability remains a fundamental challenge in artificial intelligence research. This paper presents a rigorous experimental investigation into how difficulty-aware staged reinforcement learning (RL) strategies can substantially improve LLM reasoning performance. Through systematic analysis, we demonstrate that strategically selecting training data according to well-defined difficulty levels markedly enhances RL optimization. Moreover, we introduce a staged training methodology, progressively exposing models to increasingly challenging tasks, further amplifying reasoning capabilities. Our findings reveal significant cross-domain benefits when simultaneously training models on mathematical reasoning and code generation tasks. Notably, our proposed approach enables a 1.5B parameter model to achieve an accuracy of 42.3% on the AIME-2024 benchmark, 89.5% on the MATH-500 benchmark. These results underscore the efficacy of our method in advancing the reasoning proficiency of LLMs. We will open-source our datasets on GitHub and Hugging Face.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM reasoning with difficulty-aware RL
Improving RL optimization via staged training
Boosting cross-domain performance in reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Difficulty-aware staged reinforcement learning strategy
Progressive training with increasing task difficulty
Cross-domain training on math and code tasks
Yunjie Ji
Yunjie Ji
Unknown affiliation
S
Sitong Zhao
a-m-team
Xiaoyu Tian
Xiaoyu Tian
Chinese University of Hong Kong
H
Haotian Wang
a-m-team
S
Shuaiting Chen
a-m-team
Y
Yiping Peng
a-m-team
H
Han Zhao
a-m-team
Xiangang Li
Xiangang Li
Unknown affiliation
speech recognitionnatural language processing