Not All Flips Are Conformity: Decomposing Stance Convergence in Multi-Agent LLM Debate

📅 2026-05-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

179K/year
🤖 AI Summary
This study addresses the challenge of distinguishing genuine reasoning from social conformity in multi-agent large language model (LLM) debates, where opinion convergence often obscures underlying cognitive mechanisms. The authors propose a novel three-source decomposition framework that disentangles opinion shifts into three distinct drivers: spontaneous instability, stance-induced conformity, and reasoning-induced persuasion. Leveraging controlled counterfactual experiments, information-gradient designs, self-reflection analyses, and predictive modeling based on initial-round features, the research reveals that 37% of responses on MMLU-Pro are influenced by self-reflection, while harmful conformity accounts for 29%. Targeted interventions significantly reduce harmful conformity by 13.6 percentage points (p<0.001), with the predictive model achieving an AUC of 0.79.
📝 Abstract
Multi-agent debate (MAD) is a promising strategy for improving LLM reasoning, but when agents converge on a shared answer, it is unclear whether that convergence reflects genuine deliberation or social compliance. We show that the conventional answer flip rate conflates three distinct mechanisms: spontaneous instability, stance-induced conformity, and reasoning-induced persuasion. Our three-source decomposition framework isolates each through controlled counterfactual conditions. In the primary MMLU-Pro setting, 37% of agent-question observations change under self-reflection alone, while robustness tests show substantial model-dependent instability across GPQA-Diamond and three model families; strict conformity is 29% in the primary setting and remains predominantly harmful across model replications (57-77% correct-to-wrong). A controlled information-gradient experiment reveals that even vacuous reasoning is associated with 20-39% error adoption among resistant agents, with reasoning-like presentation carrying substantial persuasive weight. Harmful conformity can be predicted from Round 0 features (AUC = 0.79), and risk-targeted intervention reduces it by 13.6 percentage points (p < 0.001). However, without correctness labels or self-reflection controls, reducing peer adoption does not improve accuracy, because harmful and beneficial influence cannot be distinguished.
Problem

Research questions and friction points this paper is trying to address.

multi-agent debate
stance convergence
social conformity
LLM reasoning
answer flip
Innovation

Methods, ideas, or system contributions that make the work stand out.

stance convergence decomposition
multi-agent LLM debate
reasoning-induced persuasion
harmful conformity
counterfactual analysis
🔎 Similar Papers
No similar papers found.