DeployBench: Benchmarking LLM Agents for Research Artifact Deployment

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the frequent failure of research code deployment due to complex environment configurations, heterogeneous toolchains, system dependencies (e.g., GPU/CUDA), and legacy compatibility issues—challenges inadequately captured by existing benchmarks. To bridge this gap, we introduce DeployBench, the first systematic, multidimensional deployment benchmark encompassing 51 research artifacts across AI/ML, computer systems, and scientific computing. DeployBench evaluates the autonomous deployment capabilities of LLM agents through hidden validation pipelines that reproduce experiments and verify outputs. Built upon the OpenHands framework, our evaluation integrates four state-of-the-art LLMs and executes end-to-end deployment and validation in full system environments. Results reveal that even the best-performing agent achieves a success rate of only 7.8%–51.0%, with 63% of failures attributed to premature self-termination or misaligned validation objectives, exposing critical deficiencies in agents’ judgment of task completion.

📝 Abstract

LLM agents have made rapid progress on software engineering and ML research tasks, but these advances often assume access to a working runnable environment. For research artifacts released alongside published papers, setting up such an environment from a fresh machine remains a major bottleneck. Existing environment setup benchmarks do not cover the full scope of research artifact deployment, which involves multi-language toolchains, system-level dependencies beyond containers (e.g. GPU/CUDA and kernel configurations), and legacy artifact compatibility. We introduce DeployBench, a multi-domain benchmark of 51 research-artifact deployment tasks spanning AI/ML, computer systems, and scientific computing, covering all these dimensions. Each task is verified by a hidden pipeline that executes the paper's designated experiment and checks its outputs. Evaluating four state-of-the-art LLMs with OpenHands yields pass-rates from 7.8% - 51.0% . Failures are dominated by a completion-judgment problem: 97 of 154 are agent-terminated self-stops, where the agent's pre-finish checks validate a different or weaker target than the paper-specific task requires. DeployBench highlights the gap between current agents and autonomous deployment, and offers a realistic testbed for scientific research agents.

Problem

Research questions and friction points this paper is trying to address.

LLM agents

research artifact deployment

environment setup

benchmark

scientific computing

Innovation

Methods, ideas, or system contributions that make the work stand out.

DeployBench

LLM agents

research artifact deployment