🤖 AI Summary
Diffusion-based policies suffer from low efficiency and optimization difficulties in reinforcement learning fine-tuning due to their multi-step denoising process. This work proposes MODIP, a framework that leverages a world model to guide offline-to-online policy fine-tuning by employing model predictive control (MPC) to generate high-quality trajectories as supervised targets, thereby achieving policy improvement while preserving the stability of behavior cloning. MODIP innovatively replaces policy-dependent Q-values with terminal-state values to accelerate MPC planning and trains its critic using policy-agnostic temporal difference (TD) targets, significantly enhancing optimization efficiency and stability. Evaluated on D4RL and RoboMimic benchmarks, MODIP outperforms standard behavior cloning and matches or exceeds the performance of existing diffusion-based fine-tuning methods and strong baselines such as TD-MPC2.
📝 Abstract
Diffusion policies (DPs) have emerged as expressive policy representations for robot learning, often used with imitation learning methods such as behavioral cloning (BC). However, while their success has largely been confined to BC, direct reinforcement learning (RL) fine-tuning remains challenging because actions are generated through a multi-step denoising process. In this work, we propose MODIP, a framework for the offline-to-online fine-tuning of DPs. Rather than directly applying RL to the DPs, MODIP leverages a world model (WM) to guide policy adaptation and keeps the simplicity and stability of BC. We utilize model predictive control (MPC) to generate high-quality trajectories within the WM, and use them as supervised targets for fine-tuning the DP. To make MPC planning efficient, MODIP uses a terminal state value instead of a policy-dependent state-action value, reducing inference time. Additionally, MODIP trains critics with policy-independent TD targets, reducing training time. Experiments on D4RL (MuJoCo, Kitchen) and RoboMimic tasks show that MODIP improves diffusion policies beyond BC, and is competitive with or outperforms diffusion policy RL fine-tuning methods and strong model-based baselines such as TD-MPC2.