🤖 AI Summary
This work addresses the challenges of insufficient demonstration coverage and distributional shift in behavioral cloning, as well as the instability and low sample efficiency encountered when directly fine-tuning large-capacity generative policies via reinforcement learning. To overcome these issues, the authors propose LP-DS, a method that learns a compact latent noise perturbation prior to decoding while keeping the generative policy frozen. This perturbation is optimized through a Lagrangian trust-region objective that maximizes returns while constraining deviations from the latent prior. LP-DS is the first approach to introduce a Lagrangian trust-region mechanism into latent noise space, effectively balancing performance gains with action entropy preservation. It is compatible with diverse architectures, including diffusion models, flow matching, and vision-language-action foundation models. Experiments demonstrate that LP-DS significantly improves sample efficiency, task success rates, and returns—by up to 25%—across RoboMimic, Gym, and Adroit benchmarks, and shows successful real-world deployment on a Franka robotic arm.
📝 Abstract
Behavior cloning with high-capacity generative policies achieves strong imitation performance, but is often limited by demonstration coverage and distribution shift. Direct reinforcement learning fine-tuning can improve performance, but updating large action decoders is frequently unstable and sample inefficient. We propose Lagrangian Perturbation Diffusion Steering (LP-DS), a lightweight adaptation method that improves a frozen generative policy by learning a compact noise-space perturbation before decoding. LP-DS optimizes this perturbation with a Lagrangian trust-region objective, improving downstream value while constraining deviation from the latent prior. Across RoboMimic manipulation, OpenAI Gym locomotion, and Adroit dexterous manipulation benchmarks, LP-DS improves sample efficiency, success, and return while maintaining higher action-space entropy than unconstrained noise-space steering, with return improvements of up to 25% over prior baselines. Additional evaluations with flow-matching backbones, a large vision-language-action model, and physical Franka deployment show that LP-DS is not limited to compact diffusion policies or simulated benchmarks. Project page: https://sites.google.com/view/lp-ds/home.