Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning

๐Ÿ“… 2025-08-05
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Reinforcement learning (RL) for large language models (LLMs) has long been constrained to single-turn tasks (e.g., mathematical reasoning), limiting its applicability to stateful, multi-turn environments such as software engineering. Method: This paper introduces the first end-to-end RL training framework tailored to realistic, multi-turn software engineering interactions. Its core innovation is Decoupled Advantage Policy Optimization (DAPO), a teacher-free RL algorithm that jointly optimizes long-context modeling and state-aware environmental feedback. Contribution/Results: Instantiated with Qwen2.5-72B-Instruct, the framework achieves a 39% success rate on SWE-bench Verifiedโ€”up from 20%โ€”and matches or exceeds state-of-the-art open-source models (e.g., DeepSeek-V3-0324 and Qwen3-235B-A22B) on SWE-rebench. This work provides the first empirical validation of RLโ€™s feasibility and effectiveness in stateful, multi-turn software engineering tasks.

Technology Category

Application Category

๐Ÿ“ Abstract
Research on applications of Reinforcement Learning (RL) to Large Language Models (LLMs) has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn MDPs, this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a non-trivial observation. To bridge this gap, we demonstrate the successful application of RL to this general regime. Using a modified Decoupled Advantage Policy Optimization (DAPO) algorithm, we train an agent based on Qwen2.5-72B-Instruct to solve real-world software engineering tasks. Our approach increases the agent's success rate on the SWE-bench Verified benchmark from a 20% rejection fine-tuned baseline to 39%, without relying on any teacher models. On SWE-rebench, our agent matches or outperforms leading open-weight models such as DeepSeek-V3-0324 and Qwen3-235B-A22B using an identical scaffolding, offering a viable path toward building more capable autonomous agents for complex real-world problems based on open models.
Problem

Research questions and friction points this paper is trying to address.

Extends RL to multi-turn LLM interactions for software engineering
Improves success rate on SWE tasks without teacher models
Matches top models in performance using open-weight models
Innovation

Methods, ideas, or system contributions that make the work stand out.

RL applied to multi-turn software engineering tasks
Modified DAPO algorithm for agent training
Qwen2.5-72B-Instruct as base model
๐Ÿ”Ž Similar Papers
No similar papers found.
A
Alexander Golubev
Nebius AI
Maria Trofimova
Maria Trofimova
Nebius AI
S
Sergei Polezhaev
Nebius AI
I
Ibragim Badertdinov
Nebius AI
M
Maksim Nekrashevich
Nebius AI
A
Anton Shevtsov
Nebius AI
S
Simon Karasik
Nebius AI
S
Sergey Abramov
Nebius AI
A
Andrei Andriushchenko
Nebius AI
F
Filipp Fisin
Nebius AI
S
Sergei Skvortsov
Nebius AI
B
Boris Yangel
Humanoid