🤖 AI Summary
Existing AI programming agent benchmarks (e.g., SWE-Bench) focus on isolated, single-step tasks and thus fail to assess long-term software evolution capabilities essential in real-world engineering. To address this gap, we introduce SWE-EVO—the first benchmark explicitly designed for long-horizon software evolution. It comprises 48 multi-step tasks derived from authentic release notes and Git histories of seven open-source Python projects, each involving modifications across an average of 21 files. Tasks emphasize requirement comprehension, cross-file coordination, and functional preservation. We formally define and quantify long-cycle evolution capability, introducing the fine-grained metric *Fix Rate*. Automated validation leverages large-scale test suites (mean 874 tests per instance) and integrates seamlessly with agent frameworks such as OpenHands. Experiments show GPT-5+OpenHands achieves only 21% task success—far below its 65% on SWE-Bench—revealing fundamental limitations in multi-step, multi-file reasoning. SWE-EVO establishes a new standard for evaluating engineering-grade capabilities of AI programming agents.
📝 Abstract
Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or implementing a small feature. However, real-world software engineering is fundamentally a long-horizon endeavor: developers must interpret high-level requirements, plan coordinated changes across many files, and evolve codebases over multiple iterations while preserving existing functionality. We introduce SWE-EVO, a benchmark that evaluates agents on this long-horizon software evolution challenge. Constructed from release notes and version histories of seven mature open-source Python projects, Tool comprises 48 evolution tasks that require agents to implement multi-step modifications spanning an average of 21 files, validated against comprehensive test suites averaging 874 tests per instance. Experiments with state-of-the-art models reveal a striking capability gap: even GPT-5 with OpenHands achieves only a 21 percent resolution rate on Tool, compared to 65 percent on the single-issue SWE-Bench Verified. This demonstrates that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a fine-grained metric that captures partial progress toward solving these complex, long-horizon tasks.