Multiagent Protocols with Aggregated Confidence Signals

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the absence of a unified confidence assessment mechanism for collective outputs in existing multi-agent systems. The authors propose three protocols that standardize individual agents’ raw confidence signals and integrate soft voting with Bayesian fusion strategies to produce a single, comparable confidence estimate for the final answer. The approach innovatively combines sequential probabilities with self-reported confidence estimators and unifies parametric and non-parametric calibration techniques. Extensive experiments across five benchmarks and four task categories demonstrate that the proposed aggregated confidence significantly outperforms both single-agent and standard debate baselines—evidenced by notable gains in AUARC—while maintaining stable F1 scores. Particularly in ambiguous tasks, the method effectively compensates for performance degradation typically observed in multi-agent debates.

📝 Abstract

Confidence is used for reliability, oversight, and a range of downstream decision tasks in Natural Language Processing (NLP), yet no existing method produces or evaluates a confidence for the output of a multiagent system. Prior work uses confidence within multiagent debate (MAD) to weight messages, trigger debate, or calibrate individual agents, but it never aggregates these into a single confidence for the system itself. We introduce three protocols that produce a final answer along with a single aggregated confidence by first transforming raw confidence signals to make them comparable across models, then combining them via soft voting or a probability fusion we call Bayesian fusion. This aggregated confidence is substantially more discriminative (AUARC) than that of the best single agent or the standard debate baselines, while correctness (F1-score) stays stable and recovers the losses MAD incurs on more ambiguous tasks. Analyzing two estimators, sequence probability and self-report, alongside parametric and non-parametric calibrators, we find that calibration improves F1 for both estimators while AUARC is less reliant on it. We evaluate six homogeneous and heterogeneous debating pairs per benchmark, across five benchmarks and four task types, spanning a range of model capabilities and sizes.

Problem

Research questions and friction points this paper is trying to address.

multiagent systems

confidence aggregation

natural language processing

system confidence

multiagent debate

Innovation

Methods, ideas, or system contributions that make the work stand out.

aggregated confidence

multiagent debate

Bayesian fusion