🤖 AI Summary
This work addresses the high learning complexity of existing diffusion-based visuomotor policies, which couple scene understanding and trajectory generation in the raw action space, thereby struggling with tasks requiring precise temporal coordination such as multi-arm collaboration. To overcome this limitation, we propose Latent Diffusion Policy (LDP), a two-stage framework that leverages a conditional variational autoencoder to compress scene information into a latent space, enabling a flow-matching model to generate trajectories within a smoother velocity field and substantially reducing learning difficulty. Our approach innovatively decouples perception from control and introduces token-wise diffusion training, staircase inference sampling, and rFID—a lightweight evaluation metric based on latent-space statistics. LDP significantly outperforms DP3 on highly collaborative tasks in RoboTwin 2.0 and has been successfully deployed on a real dual-arm robotic system.
📝 Abstract
Diffusion-based visuomotor policies operating directly in raw action spaces conflate scene comprehension with trajectory generation within a single denoising process. The resulting velocity field must simultaneously encode scene information and generate precise trajectories, increasing learning complexity and limiting performance on tasks demanding precise temporal coordination across multiple arms. To simplify this joint learning problem, we introduce Latent Diffusion Policy (LDP), a two-stage framework performing flow matching in a deliberately shaped latent space. By absorbing scene understanding into an observation-conditioned CVAE encoder, LDP concentrates the conditional distribution of each observation. Consequently, the flow model avoids implicitly resolving scene-dependent structures; instead, it generates within a pre-concentrated distribution featuring a smoother velocity field, simplifying learning from limited demonstrations. Furthermore, to capture temporal dependencies among latent tokens, LDP trains with per-token diffusion forcing and employs staircase inference sampling to resolve the resulting distributional mismatch. We also propose reconstruction FID (rFID) as a lightweight proxy predicting downstream task success solely from latent space statistics. On coordination-intensive tasks from RoboTwin 2.0, LDP outperforms DP3 by a substantial margin and transfers effectively to real-world bimanual deployments.