GPU-Parallel Multi-Task Reinforcement Learning with Demonstration Guided Policy Optimization

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

Existing GPU-based parallel reinforcement learning systems struggle to efficiently handle multitask joint learning within families of structured manipulation tasks, particularly under sparse rewards and limited demonstration data. To address this challenge, this work introduces MT-Libero, the first multitask RL benchmark supporting parallelized rendering, physics randomization, and multimodal inputs, alongside a novel demonstration-guided policy optimization method (DGPO). DGPO integrates importance-weighted PPO with adaptive behavioral cloning and leverages a tunable demonstration distribution preference to enable stable and sample-efficient online multitask training under sparse success signals. Experimental results demonstrate that DGPO significantly outperforms both prior-free RL and existing imitation learning approaches while preserving the training stability and continual improvement capabilities inherent to PPO.

📝 Abstract

Large scale GPU-parallel reinforcement learning has changed what can be trained in robot simulation, yet most systems still optimize one specialist policy per task. We propose a construction methodology for turning structured manipulation task families into GPU-parallel multi-task RL benchmarks, and instantiate it as MT-Libero using LIBERO assets and task predicates in Isaac Lab. The resulting benchmark supports simultaneous reinforcement learning over heterogeneous task suites with parallel rendering, physics randomization, and state-input or visual-input policies. To make such training practical under sparse success signals and limited prior data, we further propose DGPO, an on-policy demonstration guided method that combines importance weighted PPO with adaptive behavior cloning on matched demonstration actions. DGPO enables a tunable preference toward demonstrated task distributions, outperforming both prior-free RL and existing demonstration-based methods while preserving the stability and online improvement benefits of on-policy PPO.

Problem

Research questions and friction points this paper is trying to address.

multi-task reinforcement learning

GPU-parallel training

sparse rewards

demonstration-guided learning

heterogeneous tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

GPU-parallel multi-task RL

demonstration-guided policy optimization

MT-Libero