MODIP: Efficient Model-Based Optimization for Diffusion Policies

📅 2026-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion-based policies suffer from low efficiency and optimization difficulties in reinforcement learning fine-tuning due to their multi-step denoising process. This work proposes MODIP, a framework that leverages a world model to guide offline-to-online policy fine-tuning by employing model predictive control (MPC) to generate high-quality trajectories as supervised targets, thereby achieving policy improvement while preserving the stability of behavior cloning. MODIP innovatively replaces policy-dependent Q-values with terminal-state values to accelerate MPC planning and trains its critic using policy-agnostic temporal difference (TD) targets, significantly enhancing optimization efficiency and stability. Evaluated on D4RL and RoboMimic benchmarks, MODIP outperforms standard behavior cloning and matches or exceeds the performance of existing diffusion-based fine-tuning methods and strong baselines such as TD-MPC2.
📝 Abstract
Diffusion policies (DPs) have emerged as expressive policy representations for robot learning, often used with imitation learning methods such as behavioral cloning (BC). However, while their success has largely been confined to BC, direct reinforcement learning (RL) fine-tuning remains challenging because actions are generated through a multi-step denoising process. In this work, we propose MODIP, a framework for the offline-to-online fine-tuning of DPs. Rather than directly applying RL to the DPs, MODIP leverages a world model (WM) to guide policy adaptation and keeps the simplicity and stability of BC. We utilize model predictive control (MPC) to generate high-quality trajectories within the WM, and use them as supervised targets for fine-tuning the DP. To make MPC planning efficient, MODIP uses a terminal state value instead of a policy-dependent state-action value, reducing inference time. Additionally, MODIP trains critics with policy-independent TD targets, reducing training time. Experiments on D4RL (MuJoCo, Kitchen) and RoboMimic tasks show that MODIP improves diffusion policies beyond BC, and is competitive with or outperforms diffusion policy RL fine-tuning methods and strong model-based baselines such as TD-MPC2.
Problem

Research questions and friction points this paper is trying to address.

Diffusion Policies
Reinforcement Learning
Behavioral Cloning
Policy Fine-tuning
Robot Learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Policies
Model-Based Optimization
Model Predictive Control
Offline-to-Online Fine-tuning
World Model
🔎 Similar Papers
2024-07-16arXiv.orgCitations: 2
2024-10-07arXiv.orgCitations: 9