Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This study investigates whether increasing the number of agents genuinely enhances the performance of large language model (LLM) workflows under a unified evaluation protocol. To this end, we introduce BenchAgent, a novel evaluation framework that establishes a protocol-aligned standardization paradigm, enabling fair comparisons among single-agent, fixed multi-agent, and evolving multi-agent workflows under identical conditions—including benchmark loaders, tool access, answer contracts, and trajectory logging. Experiments across ten reasoning, programming, and tool-use benchmarks reveal that among six multi-agent systems, only EvoAgent approaches the performance of single-agent systems, while the others lag by 2.56–11.29 percentage points. Notably, runtime-generated workflows based on GPT-4.1 and Claude-Code architectures achieve 66.72% accuracy on GAIA, significantly outperforming fixed multi-agent baselines.

📝 Abstract

Does adding more agents help an LLM workflow once compared systems share the same benchmark loader, tool access, answer contract, usage accounting, and trajectory logging? We introduce BenchAgent, an evaluation framework that places single-agent, fixed multi-agent (MAS), and evolving MAS workflows under one normalized execution and logging protocol. BenchAgent evaluates these substrate-internal workflows across ten reasoning, coding, and tool-use benchmarks with GPT-4.1, and separately reports a Protocol-Aligned External (PAE) GAIA study of a runtime-generated workflow. Under SI conditions, at most one of six tested MAS exceeds the matched single-agent anchor on benchmark-balanced average accuracy: EvoAgent lies within the Wilson one-run guidance, while the remaining five trail by 2.56-11.29 points and occupy more expensive accuracy-cost trade-offs. On the PAE GAIA snapshot, a Claude-Code-style runtime workflow reaches 66.72% overall and 69.23% on Level 3, more than 20 points above the strongest non-Claude baseline, Jarvis, a fixed MAS.

Problem

Research questions and friction points this paper is trying to address.

LLM agents

multi-agent systems

workflow evaluation

benchmarking

agent collaboration

Innovation

Methods, ideas, or system contributions that make the work stand out.

BenchAgent

multi-agent systems

protocol-aligned evaluation