SetupBench: Assessing Software Engineering Agents' Ability to Bootstrap Development Environments

📅 2025-07-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a critical gap in current LLM-agent benchmarks: they predominantly assume preconfigured environments, neglecting agents’ ability to autonomously bootstrap execution environments—e.g., installing dependencies, initializing databases, and orchestrating multi-service deployments—in bare Linux sandboxes. To address this, we introduce EnvBench, the first systematic benchmark for environment bootstrapping, comprising 93 diverse tasks spanning seven programming languages, five database types, and multi-service coordination scenarios. EnvBench features a rigorous, quantifiable success criterion, a principled task taxonomy, and an automated sandbox framework that executes natural-language task specifications and verifies outcomes via deterministic shell commands. Evaluations on OpenHands reveal low overall success rates (20.0–53.3% for database configuration), with 38–89% of agent actions constituting redundant exploration—exposing fundamental weaknesses in persistent configuration management, dependency resolution, and environmental memory.

Technology Category

Application Category

📝 Abstract
Modern Large Language Model (LLM) agents promise end to end assistance with real-world software tasks, yet existing benchmarks evaluate LLM agents almost exclusively in pre-baked environments where every dependency is pre-installed. To fill this gap, we introduce SetupBench, a 93 instance benchmark that isolates the environment-bootstrap skill: starting from a bare Linux sandbox, an agent must install packages, resolve dependency conflicts, initialize databases, and configure background services. Our tasks span seven language ecosystems, five database engines, and multi-service orchestration scenarios, each accompanies by a natural language problem statement and a deterministic success command. Through evaluation of OpenHands, a state-of-the-art coding agent, we find low success rates across task categories, with particular challenges in repository setup (38.9-57.4%) and local database configuration (20.0-53.3%). Our analysis reveals systematic failure modes including incomplete development tooling installation, hallucinated task constraints, and non-persistent environment modifications that break agent-human collaboration workflows. We identify substantial inefficiencies in agent exploration strategies, with 38-89% of actions being unnecessary compared to optimal human behavior. These findings highlight gaps in current agents' practical environment-bootstrap capabilities. By targeting this critical yet under-evaluated capability, SetupBench provides a rigorous yard-stick for the next generation of software developer agents aiming to solve end to end real-wold tasks.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLM agents' ability to bootstrap development environments from scratch
Evaluating agents' skills in resolving dependencies and configuring services
Identifying inefficiencies and failure modes in environment setup tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces SetupBench for environment-bootstrap evaluation
Tests agents in bare Linux sandbox scenarios
Evaluates multi-service orchestration and dependencies
🔎 Similar Papers
No similar papers found.