VeRO: An Evaluation Harness for Agents to Optimize Agents

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of systematic evaluation methods for coding agents in agent optimization tasks, which involve iteratively improving a target agent through edit-execute-evaluate cycles. To this end, we propose VeRO, the first standardized evaluation framework for agent optimization, integrating versioned snapshots, structured execution traces, budget-constrained assessment, and a benchmark task suite. VeRO enables reliable comparison of optimization strategies and fine-grained analysis of intermediate reasoning processes. Empirical studies using this framework reveal significant performance variations across different optimizer configurations in multi-task settings. The project is open-sourced to advance research on agent optimization as a core capability of coding agents.

Technology Category

Application Category

📝 Abstract
An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Agent optimization differs fundamentally from conventional software engineering: the target agent interleaves deterministic code with stochastic LLM completions, requiring structured capture of both intermediate reasoning and downstream execution outcomes. To address these challenges, we introduce VERO (Versioning, Rewards, and Observations), which provides (1) a reproducible evaluation harness with versioned agent snapshots, budget-controlled evaluation, and structured execution traces, and (2) a benchmark suite of target agents and tasks with reference evaluation procedures. Using VERO, we conduct an empirical study comparing optimizer configurations across tasks and analyzing which modifications reliably improve target agent performance. We release VERO to support research on agent optimization as a core capability for coding agents.
Problem

Research questions and friction points this paper is trying to address.

agent optimization
coding agents
evaluation harness
LLM completions
structured execution
Innovation

Methods, ideas, or system contributions that make the work stand out.

agent optimization
evaluation harness
structured execution traces
versioned agent snapshots
benchmark suite
🔎 Similar Papers
No similar papers found.