QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of efficiently and stably scaling large reasoning models (LRMs) to long-context reasoning tasks under reinforcement learning (RL), this paper introduces the first RL training paradigm specifically designed for long-text reasoning. Our method comprises three core innovations: (1) a progressive context-scaling framework that jointly increases model capacity and input length; (2) a curriculum-guided, multi-stage RL strategy to mitigate training instability; and (3) difficulty-aware retrospective sampling to enhance learning efficiency at critical reasoning steps. Evaluated on seven long-document question-answering benchmarks, the resulting model—QwenLong-L1-32B—significantly outperforms OpenAI-o3-mini and Qwen3-235B-A22B, and achieves performance on par with Claude-3.7-Sonnet-Thinking, establishing a new state-of-the-art in long-context reasoning.

Technology Category

Application Category

📝 Abstract
Recent large reasoning models (LRMs) have demonstrated strong reasoning capabilities through reinforcement learning (RL). These improvements have primarily been observed within the short-context reasoning tasks. In contrast, extending LRMs to effectively process and reason on long-context inputs via RL remains a critical unsolved challenge. To bridge this gap, we first formalize the paradigm of long-context reasoning RL, and identify key challenges in suboptimal training efficiency and unstable optimization process. To address these issues, we propose QwenLong-L1, a framework that adapts short-context LRMs to long-context scenarios via progressive context scaling. Specifically, we utilize a warm-up supervised fine-tuning (SFT) stage to establish a robust initial policy, followed by a curriculum-guided phased RL technique to stabilize the policy evolution, and enhanced with a difficulty-aware retrospective sampling strategy to incentivize the policy exploration. Experiments on seven long-context document question-answering benchmarks demonstrate that QwenLong-L1-32B outperforms flagship LRMs like OpenAI-o3-mini and Qwen3-235B-A22B, achieving performance on par with Claude-3.7-Sonnet-Thinking, demonstrating leading performance among state-of-the-art LRMs. This work advances the development of practical long-context LRMs capable of robust reasoning across information-intensive environments.
Problem

Research questions and friction points this paper is trying to address.

Extending large reasoning models to process long-context inputs effectively
Addressing suboptimal training efficiency in long-context reasoning
Stabilizing optimization process for long-context reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive context scaling for long-context adaptation
Curriculum-guided phased RL for stable policy evolution
Difficulty-aware retrospective sampling for policy exploration
🔎 Similar Papers
No similar papers found.
Fanqi Wan
Fanqi Wan
Sun Yat-sen University
NLPLLMs
Weizhou Shen
Weizhou Shen
Tongyi Lab, Alibaba Group
S
Shengyi Liao
Qwen-Doc Team, Alibaba Group
Y
Yingcheng Shi
Qwen-Doc Team, Alibaba Group
C
Chenliang Li
Qwen-Doc Team, Alibaba Group
Z
Ziyi Yang
Qwen-Doc Team, Alibaba Group
J
Ji Zhang
Qwen-Doc Team, Alibaba Group
F
Fei Huang
Qwen-Doc Team, Alibaba Group
Jingren Zhou
Jingren Zhou
Alibaba Group, Microsoft
Cloud ComputingLarge Scale Distributed SystemsMachine LearningQuery ProcessingQuery
M
Ming Yan
Qwen-Doc Team, Alibaba Group