🤖 AI Summary
This work addresses the inefficiency and error correlation inherent in fixed-round multi-agent debate frameworks, which often lead to computational waste and amplified biases among similar agents. The authors propose a training-free, heterogeneous multi-agent debate framework that models deliberation as conditional computation, dynamically regulating the debate process through three core mechanisms: Pre-debate Agreement-based Routing (PAR), Early Agreement Stopping Evaluator (EASE), and Semantic Outlier Detection (SOD). Notably, the consensus-driven adaptive routing mechanism enables efficient heterogeneous collaboration without requiring additional training. The proposed method achieves state-of-the-art accuracy of 65.5% on MATH Level 5, 96.5% on GSM8K, 90.0% on MMLU, and 81.5% on MMLU-Pro, significantly outperforming existing fixed-round debate approaches.
📝 Abstract
Multi-agent debate (MAD) can improve large language model reasoning, but fixed debate pipelines often waste computation and can amplify correlated errors among similar agents. We propose ARMOR-MAD, a training-free heterogeneous MAD framework that treats debate as conditional computation. ARMOR-MAD combines three components: Pre-debate Agreement Routing (PAR) decides whether independently generated Round-0 answers require debate; Early Agreement Stopping Evaluator (EASE) stops debate after convergence; and Semantic Outlier Detection (SOD) down-weights abnormal final answers during aggregation. Across MATH Level 5, GSM8K, MMLU, and MMLU-Pro, ARMOR-MAD consistently improves over fixed-round heterogeneous debate with the same model pool, reaching 65.5\%, 96.5\%, 90.0\%, and 81.5\% accuracy, respectively. The results suggest that genuine model heterogeneity and agreement-based control are both important for making MAD more accurate and efficient.