SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

📅 2025-12-20

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Existing AI programming agent benchmarks (e.g., SWE-Bench) focus on isolated, single-step tasks and thus fail to assess long-term software evolution capabilities essential in real-world engineering. To address this gap, we introduce SWE-EVO—the first benchmark explicitly designed for long-horizon software evolution. It comprises 48 multi-step tasks derived from authentic release notes and Git histories of seven open-source Python projects, each involving modifications across an average of 21 files. Tasks emphasize requirement comprehension, cross-file coordination, and functional preservation. We formally define and quantify long-cycle evolution capability, introducing the fine-grained metric *Fix Rate*. Automated validation leverages large-scale test suites (mean 874 tests per instance) and integrates seamlessly with agent frameworks such as OpenHands. Experiments show GPT-5+OpenHands achieves only 21% task success—far below its 65% on SWE-Bench—revealing fundamental limitations in multi-step, multi-file reasoning. SWE-EVO establishes a new standard for evaluating engineering-grade capabilities of AI programming agents.

Technology Category

Application Category

📝 Abstract

Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or implementing a small feature. However, real-world software engineering is fundamentally a long-horizon endeavor: developers must interpret high-level requirements, plan coordinated changes across many files, and evolve codebases over multiple iterations while preserving existing functionality. We introduce SWE-EVO, a benchmark that evaluates agents on this long-horizon software evolution challenge. Constructed from release notes and version histories of seven mature open-source Python projects, Tool comprises 48 evolution tasks that require agents to implement multi-step modifications spanning an average of 21 files, validated against comprehensive test suites averaging 874 tests per instance. Experiments with state-of-the-art models reveal a striking capability gap: even GPT-5 with OpenHands achieves only a 21 percent resolution rate on Tool, compared to 65 percent on the single-issue SWE-Bench Verified. This demonstrates that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a fine-grained metric that captures partial progress toward solving these complex, long-horizon tasks.

Problem

Research questions and friction points this paper is trying to address.

Evaluates AI agents on long-horizon software evolution tasks

Assesses multi-step modifications across many files in codebases

Measures agents' ability to handle sustained, multi-file reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for long-horizon software evolution tasks

Constructed from real project release notes and histories

Proposes fine-grained metric to measure partial progress

🔎 Similar Papers

HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale