π€ AI Summary
This work addresses the challenge of deploying multi-agent reinforcement learning systems under dynamic team sizes, absent inter-agent communication, and without the possibility of online fine-tuning. To this end, the authors propose a decentralized policy based on personalized context distillation. The approach leverages a structured centralized teacher model to generate coordination tokens, which are personalized and distilled into individual agent policies. An adaptive context utilization mechanism enables each agent to infer team-level context from its local history, facilitating zero-shot generalization across varying team sizes. Experimental results demonstrate that the proposed method significantly outperforms existing baselines across three multi-agent reinforcement learning benchmarks, achieving higher collaborative returns both for seen and unseen team sizes.
π Abstract
Cooperative multi-agent reinforcement learning often assumes a fixed execution team, yet many decentralized systems must operate with varying numbers of active agents during deployment. We study this setting under episodic roster variation: each episode is executed by a set of homogeneous agents, with the team size varying across episodes. Agents act only from local histories, without execution-time communication, privileged coordinators, or online retraining. Therefore, effective cooperation requires each agent to recover relevant context about the active team and adapt its behavior accordingly. To this end, we propose PC3D (Personalized Central Coordination Context Distillation), a method for training decentralized policies to recover and use personalized coordination context from local interaction histories. During training, a set-structured centralized teacher compresses the active team into coordination tokens and personalizes them into agent-specific contexts, which are distilled into decentralized policies. At execution, each agent predicts its own context from local history and adaptively uses it to condition decision-making. Across three cooperative MARL benchmarks, PC3D achieves higher returns than the evaluated baselines with both seen and unseen roster sizes, and ablations attribute these gains to both context distillation and adaptive context use.