Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

📅 2026-01-09

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Traditional multi-agent debate (MAD) often fails to effectively enhance large language model performance due to homogeneous agents and uniform belief updating, sometimes even underperforming simple majority voting. This work proposes an improved framework that better mirrors human-like negotiation mechanisms: it employs diversity-aware initialization to increase the prior probability of correct hypotheses and introduces explicit confidence calibration in agent communication, coupled with a confidence-weighted belief update rule to systematically guide the debate toward the correct answer. Theoretical analysis and extensive experiments across six reasoning-based question-answering benchmarks demonstrate that the proposed approach significantly outperforms both conventional MAD and majority voting, thereby substantially improving the accuracy and reliability of multi-agent debates.

📝 Abstract

Multi-agent debate (MAD) is widely used to improve large language model (LLM) performance through test-time scaling, yet recent work shows that vanilla MAD often underperforms simple majority vote despite higher computational cost. Studies show that, under homogeneous agents and uniform belief updates, debate preserves expected correctness and therefore cannot reliably improve outcomes. Drawing on findings from human deliberation and collective decision-making, we identify two key mechanisms missing from vanilla MAD: (i) diversity of initial viewpoints and (ii) explicit, calibrated confidence communication. We propose two lightweight interventions. First, a diversity-aware initialisation that selects a more diverse pool of candidate answers, increasing the likelihood that a correct hypothesis is present at the start of debate. Second, a confidence-modulated debate protocol in which agents express calibrated confidence and condition their updates on others'confidence. We show theoretically that diversity-aware initialisation improves the prior probability of MAD success without changing the underlying update dynamics, while confidence-modulated updates enable debate to systematically drift to the correct hypothesis. Empirically, across six reasoning-oriented QA benchmarks, our methods consistently outperform vanilla MAD and majority vote. Our results connect human deliberation with LLM-based debate and demonstrate that simple, principled modifications can substantially enhance debate effectiveness.

Problem

Research questions and friction points this paper is trying to address.

multi-agent debate

large language models

collective decision-making

confidence calibration

agent diversity

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent debate

diversity-aware initialization

confidence calibration