ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads?

📅 2026-02-23

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing benchmarks inadequately assess the capabilities of coding agents in real-world reasoning and optimization tasks. To address this gap, this work proposes ISO-Bench, a novel benchmark comprising 54 performance optimization tasks derived from real open-source projects, built upon the vLLM and SGLang frameworks. Agents are required to generate optimization patches based on codebases and bottleneck descriptions, which are then evaluated against expert solutions. This study introduces the first multi-agent evaluation framework that integrates hard metrics (execution correctness) with soft metrics (LLM-based scoring), revealing that while agents often correctly identify performance bottlenecks, they struggle to produce effective optimizations. The results further demonstrate that system scaffolding critically influences agent performance: no single agent dominates across all tasks, and even identical base models exhibit markedly different outcomes due to variations in scaffolding design.

Technology Category

Application Category

📝 Abstract

We introduce ISO-Bench, a benchmark for coding agents to test their capabilities on real-world inference optimization tasks. These tasks were taken from vLLM and SGLang, two of the most popular LLM serving frameworks. Each task provides an agent with a codebase and bottleneck description, whereby the agent must produce an optimization patch evaluated against expert human solutions. We curated 54 tasks from merged pull requests with measurable performance improvements. While existing benchmarks heavily use runtime-based metrics, such approaches can be gamed to pass tests without capturing the actual intent of the code changes. Therefore, we combine both hard (execution-based) and soft (LLM-based) metrics to show that both are necessary for complete evaluation. While evaluating both closed and open-source coding agents, we find no single agent dominates across codebases. Surprisingly, agents often identify correct bottlenecks but fail to execute working solutions. We also show that agents with identical underlying models differ substantially, suggesting scaffolding is as important as the model.

Problem

Research questions and friction points this paper is trying to address.

coding agents

inference optimization

benchmark

performance bottlenecks

code evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

ISO-Bench

inference optimization

coding agents