ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads?

📅 2026-02-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately assess the capabilities of coding agents in real-world reasoning and optimization tasks. To address this gap, this work proposes ISO-Bench, a novel benchmark comprising 54 performance optimization tasks derived from real open-source projects, built upon the vLLM and SGLang frameworks. Agents are required to generate optimization patches based on codebases and bottleneck descriptions, which are then evaluated against expert solutions. This study introduces the first multi-agent evaluation framework that integrates hard metrics (execution correctness) with soft metrics (LLM-based scoring), revealing that while agents often correctly identify performance bottlenecks, they struggle to produce effective optimizations. The results further demonstrate that system scaffolding critically influences agent performance: no single agent dominates across all tasks, and even identical base models exhibit markedly different outcomes due to variations in scaffolding design.

Technology Category

Application Category

📝 Abstract
We introduce ISO-Bench, a benchmark for coding agents to test their capabilities on real-world inference optimization tasks. These tasks were taken from vLLM and SGLang, two of the most popular LLM serving frameworks. Each task provides an agent with a codebase and bottleneck description, whereby the agent must produce an optimization patch evaluated against expert human solutions. We curated 54 tasks from merged pull requests with measurable performance improvements. While existing benchmarks heavily use runtime-based metrics, such approaches can be gamed to pass tests without capturing the actual intent of the code changes. Therefore, we combine both hard (execution-based) and soft (LLM-based) metrics to show that both are necessary for complete evaluation. While evaluating both closed and open-source coding agents, we find no single agent dominates across codebases. Surprisingly, agents often identify correct bottlenecks but fail to execute working solutions. We also show that agents with identical underlying models differ substantially, suggesting scaffolding is as important as the model.
Problem

Research questions and friction points this paper is trying to address.

coding agents
inference optimization
benchmark
performance bottlenecks
code evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

ISO-Bench
inference optimization
coding agents
benchmark evaluation
LLM-based metrics
🔎 Similar Papers
No similar papers found.
A
Ayush Nangia
S
Shikhar Mishra
A
Aman Gokrani
Paras Chopra
Paras Chopra
Independent Researcher