๐ค AI Summary
This study addresses a critical yet overlooked issue in multi-agent large language models for medical question answering: the conflation of answer consistency with reliability, which often masks underlying inconsistencies in reasoning processes. The work identifies and characterizes a novel phenomenon termed โconsistency hallucination,โ wherein deliberation among agents increases consensus on final answers but simultaneously degrades semantic alignment across their reasoning chains. To mitigate this, the authors introduce the CARA metric to quantify reasoning alignment and propose the Grounded Debate Protocol (GDP), which enforces factual grounding and explicit stance-taking to enhance reasoning consistency. Experimental results demonstrate that GDP significantly improves reasoning alignment on MedQA-USMLE and MedThink-Bench, yielding large effect sizes (Cohenโs d = +1.43 to +1.99) without requiring additional model calls or architectural modifications.
๐ Abstract
Multi-agent LLM systems for medical question answering often treat consensus as a reliability signal: if multiple agents agree on an answer, it is presumed trustworthy. However, answer-level consensus does not entail reasoning-level alignment. We introduce CARA (Cross-Agent Reasoning Alignment), a family of automated metrics that measure whether agents who agree on an answer also agree on the reasoning. Applying CARA to a standard debate system on two medical QA benchmarks, MedQA-USMLE and MedThink-Bench, we identify the consistency illusion: a failure mode where debate reduces detectable contradictions between agents while simultaneously decreasing the semantic similarity of their reasoning chains; agents appear to agree more but reason less consistently. To improve this misalignment, we propose the Grounded Debate Protocol (GDP), a prompt-level intervention that requires agents to commit to named medical facts and take explicit stances on other agents' claims. GDP produces large, consistent alignment improvements, with Cohen's d ranging from +1.43 to +1.99, across two datasets and two backbone models, without adding LLM calls or modifying system architecture. Our results motivate cross-agent reasoning alignment as a quantity to audit alongside accuracy in safety-critical domains.