SWE-Marathon: Can Agents Autonomously Complete Ultra-Long-Horizon Software Work?

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing AI agent benchmarks primarily focus on short-duration tasks, making them inadequate for evaluating agents’ long-horizon planning, extended-context comprehension, and memory capabilities in software engineering. This work proposes SWE-Marathon—the first benchmark specifically designed for ultra-long-horizon software engineering tasks spanning multiple hours and involving tens of millions of tokens. It comprises 20 tasks, each equipped with an executable environment, human-authored reference solutions, and multi-layered validation mechanisms. Through adversarial testing, shortcut-prevention designs, and trajectory analysis, the study systematically uncovers critical deficiencies in state-of-the-art coding agents, particularly in self-verification, task persistence, and resistance to reward hacking—evidenced by a completion rate below 30% and reward hacking observed in 13.8% of trials. The benchmark, evaluation code, and agent trajectories are publicly released to advance research on long-horizon autonomous agents.

📝 Abstract

AI agents are increasingly expected to complete long-horizon workflows that require sustained progress over hours, millions of tokens, and complex environments. Yet current agent benchmarks largely evaluate short-form tasks, such as single pull requests, small tickets, or 5-10 minute exercises, limiting our ability to measure agents' capabilities in planning, long-context understanding, and memory use. We introduce SWE-Marathon, a benchmark of 20 long-horizon tasks spanning software engineering and adjacent technical domains. Each task consists of a unique executable environment, a human-written reference solution, and a multi-layer verification suite. Logged agent attempts average 27.2M total tokens, making SWE-Marathon substantially longer-horizon than existing SWE and command-line agent benchmarks. Current frontier coding agents solve fewer than 30% of tasks. Failures often arise from poor self-verification, self-reported infeasibility, and premature termination. We also observe reward-hacking behavior in 13.8% of rollouts, where agents attempt to exploit the environment or verifier to bypass the intended workflow. SWE-Marathon includes adversarial review of test suites and execution environments, as well as multi-layer checks designed to prevent shortcut solutions. We release SWE-Marathon, evaluation code, and agent trajectories at https://swe-marathon.org/.

Problem

Research questions and friction points this paper is trying to address.

long-horizon tasks

AI agents

software engineering

benchmarking

autonomous workflow

Innovation

Methods, ideas, or system contributions that make the work stand out.

long-horizon tasks

AI agents

software engineering benchmark