Value-Guidance MeanFlow for Offline Multi-Agent Reinforcement Learning

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the trade-off among distributional shift, policy efficiency, and robustness in offline multi-agent reinforcement learning by proposing the VGM²P framework, which uniquely integrates global advantage-guided learning with a classifier-free guided MeanFlow generative model. The approach formulates optimal policy learning as conditional behavior cloning, enabling efficient single-step action generation without iterative optimization and exhibiting insensitivity to behavioral regularization coefficients. Evaluated on both discrete and continuous action space tasks, the method achieves performance on par with state-of-the-art approaches using conditional behavior cloning alone, substantially improving sample efficiency and policy stability.
📝 Abstract
Offline multi-agent reinforcement learning (MARL) aims to learn the optimal joint policy from pre-collected datasets, requiring a trade-off between maximizing global returns and mitigating distribution shift from offline data. Recent studies use diffusion or flow generative models to capture complex joint policy behaviors among agents; however, they typically rely on multi-step iterative sampling, thereby reducing training and inference efficiency. Although further research improves sampling efficiency through methods like distillation, it remains sensitive to the behavior regularization coefficient. To address the above-mentioned issues, we propose Value Guidance Multi-agent MeanFlow Policy (VGM$^2$P), a simple yet effective flow-based policy learning framework that enables efficient action generation with coefficient-insensitive conditional behavior cloning. Specifically, VGM$^2$P uses global advantage values to guide agent collaboration, treating optimal policy learning as conditional behavior cloning. Additionally, to improve policy expressiveness and inference efficiency in multi-agent scenarios, it leverages classifier-free guidance MeanFlow for both policy training and execution. Experiments on tasks with both discrete and continuous action spaces demonstrate that, even when trained solely via conditional behavior cloning, VGM$^2$P efficiently achieves performance comparable to state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

offline multi-agent reinforcement learning
distribution shift
sampling efficiency
behavior regularization
joint policy learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

offline multi-agent reinforcement learning
flow-based generative model
conditional behavior cloning
classifier-free guidance
value-guided policy
🔎 Similar Papers
No similar papers found.
T
Teng Pang
School of Software, Shandong University, Jinan, China
Z
Zhiqiang Dong
School of Software, Shandong University, Jinan, China
Y
Yan Zhang
School of Software, Shandong University, Jinan, China
R
Rongjian Xu
School of Software, Shandong University, Jinan, China
Guoqiang Wu
Guoqiang Wu
Associate Professor, Shandong University
Machine LearningLearning TheoryReinforcement Learning
Y
Yilong Yin
School of Software, Shandong University, Jinan, China