BENCHAGENTS: Automated Benchmark Creation with Agent Interaction

📅 2024-10-29

🏛️ arXiv.org

📈 Citations: 9

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing evaluation of generative AI is hindered by the scarcity of high-quality benchmarks, whose manual construction is costly and time-consuming. Method: We propose the first automated benchmark construction framework powered by collaborative large language model (LLM) agents, decomposing benchmark creation into four sequential stages—planning, generation, verification, and evaluation—integrating task decomposition, agent coordination, human-in-the-loop feedback, and explicit constraint-satisfaction assessment. Contribution/Results: The framework significantly enhances data diversity and metric reliability. Leveraging it, we construct the first high-quality benchmark specifically targeting planning and constraint-satisfaction capabilities in text generation. We systematically evaluate seven state-of-the-art models, uncovering shared failure modes and fine-grained capability disparities. Our work establishes a scalable, reproducible paradigm for evaluating generative AI capabilities, advancing both benchmark methodology and empirical analysis.

Technology Category

Application Category

📝 Abstract

Evaluations are limited by benchmark availability. As models evolve, there is a need to create benchmarks that can measure progress on new generative capabilities. However, creating new benchmarks through human annotations is slow and expensive, restricting comprehensive evaluations for any capability. We introduce BENCHAGENTS, a framework that methodically leverages large language models (LLMs) to automate benchmark creation for complex capabilities while inherently ensuring data and metric quality. BENCHAGENTS decomposes the benchmark creation process into planning, generation, data verification, and evaluation, each of which is executed by an LLM agent. These agents interact with each other and utilize human-in-the-loop feedback from benchmark developers to explicitly improve and flexibly control data diversity and quality. We use BENCHAGENTS to create benchmarks to evaluate capabilities related to planning and constraint satisfaction during text generation. We then use these benchmarks to study seven state-of-the-art models and extract new insights on common failure modes and model differences.

Problem

Research questions and friction points this paper is trying to address.

Automating high-quality benchmark creation for evolving AI models

Overcoming slow manual benchmark creation via multi-agent framework

Generating structured benchmarks for complex reasoning and multimodal evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent framework automates benchmark creation process

LLM agents orchestrate planning generation verification evaluation

Agents interact with developer feedback to control quality

🔎 Similar Papers

AgentStudio: A Toolkit for Building General Virtual Agents

2024-03-26arXiv.orgCitations: 8

Google

$262,000-$365,000 + bonus + equity + benefits.

Mountain View, CA, USA

Research Scientist Intern, Multimodal AI (PhD)