🤖 AI Summary
In heterogeneous multi-task learning, significant disparities in observation/action spaces and task difficulty cause gradient conflicts and model plasticity degradation. To address this, we propose ScaleZero—a Mixture-of-Experts (MoE)-based multi-task world model. Its core innovation is a LoRA-driven dynamic parameter expansion mechanism that enables on-demand expert activation and low-rank incremental updates, balancing computational efficiency with continual knowledge retention. ScaleZero integrates online reinforcement learning, dynamic parameter scaling, and task-adaptive expert routing. Evaluated on Atari, DeepMind Control Suite (DMControl), and Jericho benchmarks, it matches or exceeds the performance of task-specialized single-task models. Remarkably, it achieves competitive performance using only 80% of the environment interaction steps required by baseline methods, demonstrating substantial improvements in both sample and computational efficiency.
📝 Abstract
In heterogeneous multi-task learning, tasks not only exhibit diverse observation and action spaces but also vary substantially in intrinsic difficulty. While conventional multi-task world models like UniZero excel in single-task settings, we find that when handling large-scale heterogeneous environments, gradient conflicts and the loss of model plasticity often constrain their sample and computational efficiency. In this work, we address these challenges from two perspectives: the single learning iteration and the overall learning process. First, we investigate the impact of key design spaces on extending UniZero to multi-task planning. We find that a Mixture-of-Experts (MoE) architecture provides the most substantial performance gains by mitigating gradient conflicts, leading to our proposed model, extit{ScaleZero}. Second, to dynamically balance the computational load across the learning process, we introduce an online, LoRA-based extit{dynamic parameter scaling} (DPS) strategy. This strategy progressively integrates LoRA adapters in response to task-specific progress, enabling adaptive knowledge retention and parameter expansion. Empirical evaluations on standard benchmarks such as Atari, DMControl (DMC), and Jericho demonstrate that ScaleZero, relying exclusively on online reinforcement learning with one model, attains performance on par with specialized single-task baselines. Furthermore, when augmented with our dynamic parameter scaling strategy, our method achieves competitive performance while requiring only 80% of the single-task environment interaction steps. These findings underscore the potential of ScaleZero for effective large-scale multi-task learning. Our code is available at extcolor{magenta}{https://github.com/opendilab/LightZero}.