SetupBench: Assessing Software Engineering Agents' Ability to Bootstrap Development Environments

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work identifies a critical gap in current LLM-agent benchmarks: they predominantly assume preconfigured environments, neglecting agents’ ability to autonomously bootstrap execution environments—e.g., installing dependencies, initializing databases, and orchestrating multi-service deployments—in bare Linux sandboxes. To address this, we introduce EnvBench, the first systematic benchmark for environment bootstrapping, comprising 93 diverse tasks spanning seven programming languages, five database types, and multi-service coordination scenarios. EnvBench features a rigorous, quantifiable success criterion, a principled task taxonomy, and an automated sandbox framework that executes natural-language task specifications and verifies outcomes via deterministic shell commands. Evaluations on OpenHands reveal low overall success rates (20.0–53.3% for database configuration), with 38–89% of agent actions constituting redundant exploration—exposing fundamental weaknesses in persistent configuration management, dependency resolution, and environmental memory.

Technology Category

Application Category

📝 Abstract

Modern Large Language Model (LLM) agents promise end to end assistance with real-world software tasks, yet existing benchmarks evaluate LLM agents almost exclusively in pre-baked environments where every dependency is pre-installed. To fill this gap, we introduce SetupBench, a 93 instance benchmark that isolates the environment-bootstrap skill: starting from a bare Linux sandbox, an agent must install packages, resolve dependency conflicts, initialize databases, and configure background services. Our tasks span seven language ecosystems, five database engines, and multi-service orchestration scenarios, each accompanies by a natural language problem statement and a deterministic success command. Through evaluation of OpenHands, a state-of-the-art coding agent, we find low success rates across task categories, with particular challenges in repository setup (38.9-57.4%) and local database configuration (20.0-53.3%). Our analysis reveals systematic failure modes including incomplete development tooling installation, hallucinated task constraints, and non-persistent environment modifications that break agent-human collaboration workflows. We identify substantial inefficiencies in agent exploration strategies, with 38-89% of actions being unnecessary compared to optimal human behavior. These findings highlight gaps in current agents' practical environment-bootstrap capabilities. By targeting this critical yet under-evaluated capability, SetupBench provides a rigorous yard-stick for the next generation of software developer agents aiming to solve end to end real-wold tasks.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLM agents' ability to bootstrap development environments from scratch

Evaluating agents' skills in resolving dependencies and configuring services

Identifying inefficiencies and failure modes in environment setup tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces SetupBench for environment-bootstrap evaluation

Tests agents in bare Linux sandbox scenarios

Evaluates multi-service orchestration and dependencies

🔎 Similar Papers

Large Language Model-Based Agents for Software Engineering: A Survey