🤖 AI Summary
This work addresses the instability and suboptimality of multi-agent large language model (LLM) systems, which often fail to surpass the best single-model sampling due to the absence of a well-defined coordination equilibrium. The authors formulate the problem as a discounted Markov game with incomplete information and propose Heterogeneous Quantal Response Equilibrium (HQRE) as a stable coordination objective, which guarantees uniqueness, linear convergence, bounded Bayesian regret, and enables measurable stability diagnostics. Building upon HQRE, they introduce two novel algorithms—DICE-PC (Prompt Control) and DICE-FT (Parameter-Efficient Fine-Tuning)—which, for the first time, integrate entropy-regularized equilibrium selection into multi-agent LLM coordination. Evaluated across 11 benchmarks in four domains, both methods significantly outperform strong baselines, achieving average improvements of 4.3% with DICE-PC and 8.5% with DICE-FT on reasoning and planning tasks.
📝 Abstract
Multi-agent large language model (LLM) systems often fail to reliably outperform a single strong model equipped with best-of-N sampling. We argue that a core source of this instability is ill-posed equilibrium selection: current systems specify what information agents share, but not which coordination convention should be selected. We formalize a broad class of such systems as discounted incomplete-information Markov games and show that two common pathologies, oscillation between competing conventions and drift across them, can both induce unstable learning and linear Bayesian regret. To obtain a well-posed target, we introduce the Heterogeneous Quantal Response Equilibrium (HQRE), an entropy-regularized equilibrium concept with agent- and state-dependent temperatures. Under a monotonicity condition, HQRE is unique, admits linearly convergent mirror updates, and yields bounded Bayesian regret; the same condition yields rollout-measurable stability diagnostics. We instantiate this objective in two algorithms: DICE-PC, which coordinates frozen models through prompt-control actions, and DICE-FT, which performs parameter-efficient mirror fine-tuning. Across eleven benchmarks in four domains, DICE improves accuracy-cost trade-offs over strong within-class baselines; on reasoning and planning tasks, DICE-PC improves by 4.3 percentage points on average and DICE-FT by 8.5 points.