Multi-Agent Cross-Entropy Method with Monotonic Nonlinear Critic Decomposition

πŸ“… 2025-11-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In CTDE frameworks, the critic-decentralization mismatch (CDM) arises when centralized critics and decentralized actors are misaligned, causing agents to degrade in learning due to others’ suboptimal behaviors. Existing value decomposition methods face a trade-off: linear decompositions enable decentralized gradient computation but lack representational capacity; nonlinear decompositions offer stronger expressivity yet require centralized gradients, reintroducing CDM. This paper proposes Monotonic Nonlinear Critic Decomposition (MNCD) and Multi-Agent Cross-Entropy Optimization (MACE), enabling fully decentralized policy updates while significantly enhancing joint value function representation. Furthermore, we integrate an improved *k*-step return with Retrace off-policy correction to boost training stability and sample efficiency. Empirical evaluation demonstrates that our approach consistently outperforms state-of-the-art methods across multiple continuous- and discrete-action benchmark tasks.

Technology Category

Application Category

πŸ“ Abstract
Cooperative multi-agent reinforcement learning (MARL) commonly adopts centralized training with decentralized execution (CTDE), where centralized critics leverage global information to guide decentralized actors. However, centralized-decentralized mismatch (CDM) arises when the suboptimal behavior of one agent degrades others' learning. Prior approaches mitigate CDM through value decomposition, but linear decompositions allow per-agent gradients at the cost of limited expressiveness, while nonlinear decompositions improve representation but require centralized gradients, reintroducing CDM. To overcome this trade-off, we propose the multi-agent cross-entropy method (MCEM), combined with monotonic nonlinear critic decomposition (NCD). MCEM updates policies by increasing the probability of high-value joint actions, thereby excluding suboptimal behaviors. For sample efficiency, we extend off-policy learning with a modified k-step return and Retrace. Analysis and experiments demonstrate that MCEM outperforms state-of-the-art methods across both continuous and discrete action benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Addresses centralized-decentralized mismatch in multi-agent reinforcement learning
Overcomes trade-off between gradient flexibility and representation expressiveness
Improves sample efficiency with modified returns and off-policy learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent cross-entropy method updates joint policies
Monotonic nonlinear critic decomposition enables decentralized gradients
Off-policy learning with modified returns improves sample efficiency
πŸ”Ž Similar Papers
No similar papers found.
Y
Yan Wang
School of Computing Technologies, RMIT University
K
Ke Deng
School of Computing Technologies, RMIT University
Yongli Ren
Yongli Ren
School of Computing Technologies, RMIT University
Recommender SystemQuantum Machine LearningData Science