SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Current LLM-based software engineering (SWE) agents face two critical bottlenecks: scarcity of high-quality, realistic interactive training data, and rapid obsolescence of static evaluation benchmarks due to data contamination. This paper introduces the first automated, scalable framework for GitHub task acquisition and contamination-free evaluation—enabling continuous extraction, cleaning, and structured formatting of interactive Python tasks from open-source repositories, and constructing an RL-ready dynamic benchmark. Key innovations include a real-time contamination detection mechanism, dynamic environment interaction modeling, and a multi-model comparative evaluation protocol. We release SWE-rebench, a public dataset comprising over 21,000 tasks, and empirically demonstrate that SWE-bench suffers from substantial performance inflation. Our results validate the necessity of contamination-aware, time-sensitive benchmarks for robust SWE agent development, advancing training and evaluation toward scalability, contextual fidelity, and dynamism.

Technology Category

Application Category

📝 Abstract

LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advancing this field faces two critical challenges. First, high-quality training data is scarce, especially data that reflects real-world SWE scenarios, where agents must interact with development environments, execute code and adapt behavior based on the outcomes of their actions. Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks, lacking both scale and diversity. Second, the lack of fresh interactive SWE tasks affects evaluation of rapidly improving models, as static benchmarks quickly become outdated due to contamination issues. To address these limitations, we introduce a novel, automated, and scalable pipeline to continuously extract real-world interactive SWE tasks from diverse GitHub repositories. Using this pipeline, we construct SWE-rebench, a public dataset comprising over 21,000 interactive Python-based SWE tasks, suitable for reinforcement learning of SWE agents at scale. Additionally, we use continuous supply of fresh tasks collected using SWE-rebench methodology to build a contamination-free benchmark for agentic software engineering. We compare results of various LLMs on this benchmark to results on SWE-bench Verified and show that performance of some language models might be inflated due to contamination issues.

Problem

Research questions and friction points this paper is trying to address.

Lack of high-quality real-world SWE training data

Shortage of fresh interactive tasks for model evaluation

Contamination issues in static SWE benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline for GitHub task extraction

SWE-rebench: 21,000 interactive Python tasks

Contamination-free benchmark for SWE agents

🔎 Similar Papers

System for systematic literature review using multiple AI agents: Concept and an empirical evaluation