The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing evaluations of multi-agent debate systems focus solely on the correctness of final answers while neglecting the quality of intermediate reasoning. This work proposes a dual-agent debate framework comprising a Constructor and an Auditor, integrating LLM-as-Judge to provide fine-grained scoring of reasoning processes and employing token-level log-probability analysis to trace internal confidence dynamics. The study reveals, for the first time, an asymmetric relationship between internal confidence and externally judged reasoning quality: the Constructor’s confidence correlates with reasoning quality twice as strongly as the Auditor’s. Building on this insight, the authors introduce a novel paradigm that leverages confidence trajectories to diagnose critical reasoning failures, achieving an AUROC of 0.804 on scoring tasks.

📝 Abstract

Multi-agent debate systems are typically evaluated only on whether the final answer is correct, overlooking the quality of the intermediate reasoning that debate is designed to produce. This paper studies the relationship between three signals in multi-agent debate: token-level log-probability distributions over reasoning tokens, LLM-as-judge rubric scores assigned to those tokens, and final task accuracy. We examine whether internal confidence signals predict externally evaluated reasoning quality, and whether either signal aligns with task correctness, across three domains: rubric-based scoring, mathematical reasoning, and factual question answering. Our framework pairs a two-agent debate architecture -- a Constructor and an Auditor -- with an LLM-as-judge that scores each agent's reasoning along instruction following, justification quality, and evidence grounding, together with a critical-failure flag. Experiments in the rubric-scoring domain reveal a consistent four-phase confidence trajectory and a substantial role asymmetry: confidence aligns with judged reasoning quality roughly twice as strongly for the Constructor as for the Auditor, and confidence-based detection of critical reasoning failures is markedly more reliable for the Constructor (AUROC 0.804) than for the Auditor (0.634). These findings motivate the broader cross-domain investigation proposed in this paper.

Problem

Research questions and friction points this paper is trying to address.

multi-agent debate

reasoning quality

log-probability

LLM-as-judge

confidence signals

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent debate

log-probabilities

LLM-as-Judge