🤖 AI Summary
This study addresses the lack of a unified evaluation framework for multi-agent large language model systems, where nominal agent counts conflate computational overhead with genuine contributions to independent reasoning. The authors propose a two-parameter scaling law that characterizes the relationship between effective team size and nominal size, and leverage mean-field theory to demonstrate that system dynamics are governed solely by the product of debate rounds and peer interactions per agent. They establish the first general scaling framework jointly accounting for answer diversity and correctness redundancy, identifying three asymptotic regimes and revealing that heterogeneous teams can surpass hard performance ceilings. Evaluated across 44 experimental conditions—including Qwen, Llama, and Ministral models on diverse tasks—the model achieves R² > 0.99. Key findings include the ineffectiveness of dense debate in enhancing diversity, reduced scaling constants under heterogeneous architectures, and the insignificance of communication protocol interventions.
📝 Abstract
Inference-time multi-agent LLM scaling lacks a shared unit: counting nominal agents conflates cost with independent evidence. We derive a two-parameter scaling law $R(N) = N_\text{eff}/N = 1/(1+c(N-1)N^{-β})$ where the regime exponent $β$ classifies any configuration into one of three asymptotic regimes -- hard-ceiling at $1/c$ ($β= 0$), sublinear at $N^β/c$ ($0 < β< 1$), or linear ($β\ge 1$), and a mean-field theorem predicts that peer count $k$ and rounds $τ$ during agent debate enter the dynamics only through their product $kτ$. The law applies at two levels: answer diversity and correctness redundancy.
Across 44 (model $\times$ task $\times$ condition) cells spanning peer debate, self-correction, random-noise placebo, self-consistency, three open-weight families (Qwen, Llama, Ministral) at scales from 7B to 32B with a frontier API check (Gemini), thinking models, heterogeneous teams, and sparse communication, the functional form fits every condition at $R^2 > 0.99$; only $(c, β)$ shifts. On free-form math, dense peer influence collapses the answer-level regime from sublinear into hard-ceiling; correctness-level fits remain hard-ceiling throughout. Three findings have practical implications. \emph{(i)}~Thirty dense debating agents produce no more answer diversity than one on MMLU-Hard. \emph{(ii)}~A noise placebo tracks self-correction on free-form math and at $4\times$ scale, so within homogeneous teams the gain commonly attributed to ``debate'' comes from re-evaluation, not peer content. \emph{(iii)}~A single $N \le 5$ pilot predicts the $N=30$ structural ceiling, and within the configurations tested only architectural diversity (heterogeneous teams) lowers $c$ and escapes the hard-ceiling regime, communication-mode interventions do not.