Lagrangian Perturbation Diffusion Steering: Latent Reinforcement Learning for Generative Policies

📅 2026-05-31
📈 Citations: 0
Influential: 0
📄 PDF

career value

232K/year
🤖 AI Summary
This work addresses the challenges of insufficient demonstration coverage and distributional shift in behavioral cloning, as well as the instability and low sample efficiency encountered when directly fine-tuning large-capacity generative policies via reinforcement learning. To overcome these issues, the authors propose LP-DS, a method that learns a compact latent noise perturbation prior to decoding while keeping the generative policy frozen. This perturbation is optimized through a Lagrangian trust-region objective that maximizes returns while constraining deviations from the latent prior. LP-DS is the first approach to introduce a Lagrangian trust-region mechanism into latent noise space, effectively balancing performance gains with action entropy preservation. It is compatible with diverse architectures, including diffusion models, flow matching, and vision-language-action foundation models. Experiments demonstrate that LP-DS significantly improves sample efficiency, task success rates, and returns—by up to 25%—across RoboMimic, Gym, and Adroit benchmarks, and shows successful real-world deployment on a Franka robotic arm.
📝 Abstract
Behavior cloning with high-capacity generative policies achieves strong imitation performance, but is often limited by demonstration coverage and distribution shift. Direct reinforcement learning fine-tuning can improve performance, but updating large action decoders is frequently unstable and sample inefficient. We propose Lagrangian Perturbation Diffusion Steering (LP-DS), a lightweight adaptation method that improves a frozen generative policy by learning a compact noise-space perturbation before decoding. LP-DS optimizes this perturbation with a Lagrangian trust-region objective, improving downstream value while constraining deviation from the latent prior. Across RoboMimic manipulation, OpenAI Gym locomotion, and Adroit dexterous manipulation benchmarks, LP-DS improves sample efficiency, success, and return while maintaining higher action-space entropy than unconstrained noise-space steering, with return improvements of up to 25% over prior baselines. Additional evaluations with flow-matching backbones, a large vision-language-action model, and physical Franka deployment show that LP-DS is not limited to compact diffusion policies or simulated benchmarks. Project page: https://sites.google.com/view/lp-ds/home.
Problem

Research questions and friction points this paper is trying to address.

behavior cloning
distribution shift
reinforcement learning
sample efficiency
generative policies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lagrangian Perturbation Diffusion Steering
latent reinforcement learning
noise-space perturbation
trust-region optimization
frozen generative policy
🔎 Similar Papers
No similar papers found.