🤖 AI Summary
This study addresses the prevailing assumption that multi-agent systems (MAS) inherently outperform single-agent systems (SAS), despite a lack of rigorous benchmarks validating their purported advantages in context preservation and parallel processing. Through systematic comparison between automatically generated MAS and Chain-of-Thought Self-Consistency (CoT-SC) on both traditional reasoning and interactive multi-step tasks, we find that auto-generated MAS consistently underperform while incurring up to tenfold higher computational costs. To diagnose the core capabilities of MAS, we introduce the first synthetic task-decomposition dataset specifically designed for this purpose. Ablation studies demonstrate that expert-designed MAS significantly surpass automated approaches in both performance and efficiency, revealing severe architectural bloat in current auto-generated frameworks and challenging the dominant paradigms in MAS design.
📝 Abstract
Prevailing wisdom posits that Multi-Agent Systems (MAS) are superior to Single-Agent Systems (SAS), citing advantages like context protection, parallel processing and distributed decision-making. However, empirical support for this claim relies primarily on comparisons with SAS baselines using benchmarks that prioritize isolated reasoning tasks, which do not adequately assess these advantages. Focusing on automatically generated MAS that are designed for enhanced generalizability over manually-designed counterparts, we perform a rigorous, systematic evaluation against SAS, specifically Chain-of-Thought with Self-Consistency (CoT-SC). Across traditional reasoning datasets and tasks with interactive multi-step workflows (e.g., BrowseComp-Plus), we demonstrate that automatic MAS consistently underperform CoT-SC despite being up to 10x more expensive. To isolate these failures from limitations inherent to task structure, we introduce a diagnostic synthetic dataset tailored for MAS featuring explicit task decomposition, context separation and parallelization potential. We show that expert-architected MAS consistently outperforms automatically generated architectures in both raw performance and cost-efficiency on this dataset, demonstrating that existing evaluation frameworks mask critical architectural gaps and inefficiencies of complex MAS by failing to account for the marginal utility of increased computational cost. Critically, systematic deconstruction of the generated MAS architectures reveals that current automated design paradigms produce architectural bloat that prioritizes superficial complexity which does not translate into functional utility, exposing a fundamental misalignment with multi-agent principles.